基于WaveNet的端到端语音合成方法

> 中国论文网 >
科技论文 >
基于WaveNet的端到端语音合成方法

基于WaveNet的端到端语音合成方法

来源:用户上传作者:

　　摘要：针对端到端语音合成系统中GriffinLim算法恢复相位信息合成语音保真度较低、人工处理痕迹明显的问题，提出了一种基于WaveNet网络架构的端到端语音合成方法。以序列映射Seq2Seq结构为基础，首先将输入文本转化为onehot向量，然后引入注意力机制获取梅尔声谱图，最后利用WaveNet后端处理网络重构语音信号的相位信息，从而将梅尔频谱特征逆变换为时域波形样本。实验的测试语料为LJSpeech1.0和THchs30，针对英语、汉语两个语种进行了实验，实验结果表明平均意见得分（MOS）分别为3.31、3.02，在合成自然度方面优于采用GriffinLim算法的端到端语音合成系统以及参数式语音合成系统。
　　关键词：语音合成;端到端;Seq2Seq;GriffinLim算法;WaveNet
　　中图分类号：TN912.33
　　文献标志码：A
　　Abstract： GriffinLim algorithm is widely used in endtoend speech synthesis with phase estimation， which always produces obviously artificial speech with low fidelity. Aiming at this problem， a system for endtoend speech synthesis based on WaveNet network architecture was proposed. Based on Seq2Seq （SequencetoSequence） structure， firstly the input text was converted into a onehot vector， then， the attention mechanism was introduced to obtain a Mel spectrogram， finally WaveNet network was used to reconstruct phase information to generate timedomain waveform samples from the Mel spectrogram features. Aiming at English and Chinese， the proposed method achieves a Mean Opinion Score （MOS） of 3.31 on LJSpeech1.0 corpus and 3.02 on THchs30 corpus， which outperforms the endtoend systems based on GriffinLim algorithm and parametric systems in terms of naturalness.
　　0 引言
　　語音合成（Speech Synthesis），又称文语转换（Text To Speech， TTS）技术是指计算机通过分析将任意文本转化为流畅语音的技术。语音合成作为实现人机语音交互系统的核心技术之一[1]，是语音处理技术中一个重要的方向，其应用价值越来越受到重视。
　　语音合成领域的主导技术随着时代的发展不断更迭。基于波形拼接的语音合成方法，是一项把预先录制的语音波形片段拼接在一起的技术，是目前语音合成领域常用方法之一[2-5]。受到语料库内容的限制，这种方法对拼接算法的优化、存储配置的调整等方面有较大的要求，对于语料库之外的其他说话人、其他文本内容起不到任何作用。
　　随着基于统计参数的语音合成方法日益成熟，这种方法被逐渐应用到语音合成中[6]。基于统计参数的语音合成方法的基本思想是，通过对输入的训练语音进行参数分解，然后对声学参数建模，并构建参数化训练模型，生成训练模型库，最后在模型库的指导下，预测待合成文本的语音参数，将参数输入声码器合成目标语音，这种方法解决了拼接式合成方法中边界人工痕迹很多的问题。然而由这些方法构造的系统需要大量的专业领域知识，因而设计困难，并且所需模块通常是单独训练，产生自每个模块的错误会有叠加效应，生成的语音与人类语音相比，经常模糊不清并且不自然。
　　随着人工智能技术的快速发展，语音合成领域有了新的技术支持。深度学习可以将内部模块统一到一个模型中，并直接连接输入和输出，减少了基于特定领域知识的密集工程参数模型，这种技术被称为“端到端”学习。设计一个能在已标注的（文本、语音）配对数据集上训练的端到端的语音合成系统，会带来诸多优势：第一，这样的系统可以基于各种属性进行多样化的调节，比如不同说话人、不同语言，或者像语义这样的高层特征;第二，与存在错误叠加效应的多阶段模型相比，单一模型更鲁棒。
　　近年来端到端的语音合成系统引起了广泛的研究，WaveNet[7]是一个强大的语音生成模型，它在TTS中表现良好，但样本级自回归的特性导致其速度较慢，需要一个复杂的前端文本分析系统，因此不是端到端语音合成系统。Deep Voice[8]将传统TTS系统流水线中的每一个模块分别用神经网络架构代替，然而它的每个模块都是单独训练的，要把系统改成端到端的方式比较困难。Char2Wav[9]是一个独立开发的可以在字符数据上训练的端到端模型，但是它需要传统的声码器参数作为中间特征表达，不能直接预测输出频谱特征。Tacotron[10]是一个从字符序列生成幅度谱的Seq2Seq（SequencetoSequence）架构，它仅用输入数据训练出一个单一的神经网络，用于替代语言学和声学特征的生成模块，使用GriffinLim算法[11]估计相位，施加短时傅里叶变换合成语音，从而简化了传统语音合成的流水线，然而GriffinLim算法会产生特有的人工痕迹并且合成的语音保真度较低，因此需要替换成神经网络架构。　　本文针对目前端到端系统中GriffinLim算法还原语音信号自然度较低的问题，提出了一种基于WaveNet网络架构的端到端语音合成方法，采用基于注意力机制的Seq2Seq架构作为特征预测网络，将输入文本转化为梅尔声谱图，结合WaveNet架构实现了多语种的语音合成。
　　4 结语
　　本文主要介绍的端到端语音合成系统，首先用基于注意力机制的Seq2Seq模型训练一个特征预测网络，然后获取待合成语音的梅尔声谱图，利用WaveNet架构恢复损失的相位信息来实现语音合成。在实验中，采用WaveNet架构的系统性能优于采用GriffinLim算法作为波形转换器的系统。实验中，随着训练步数的增加，系统的性能提高，迭代至200k次后趋于稳定。调整字符的表征方式，可以实现不同语言的合成。由于中文特征表达以及韵律结构较为复杂，所以合成自然度不如英文语音。
　　本次实验中采用的Seq2Seq架构主要为RNN的组合。在后续的研究中会探讨其他网络组合对合成质量的影响，对WaveNet网络结构进行修订以提升收斂速度也是一个值得研究的课题。
　　参考文献（References）
　　[1] FUNG P， SCHULTZ T. Multilingual spoken language processing [J]. IEEE Signal Processing Magazine， 2008， 25（3）：89-97.
　　[2] HUNT A J， BLACK A W. Unit selection in a concatenative speech synthesis system using a large speech database[C]// Proceedings of the 1996 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway， NJ： IEEE， 1996： 373-376.
　　[3] CAMPBELL N， BLACK A W. Prosody and the selection of source units for concatenative synthesis [M]// Progress in Speech Synthesis. New York： Springer， 1997： 279-292.
　　[4] ZE H， SENIOR A， SCHUSTER M. Statistical parametric speech synthesis using deep neural networks [C]// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway， NJ： IEEE， 2013：7962-7966.
　　[5] TOKUDA K， NANKAKU Y， TODA T， et al. Speech synthesis based on hidden Markov models[J]. Proceedings of the IEEE， 2013， 101（5）： 1234-1252.
　　[6] ZEN H， TOKUDA K， BLACK A W. Statistical parametric speech synthesis [J]. Speech Communication， 2009， 51（11）：1039-1064.
　　[7] OORD A V D， DIELEMAN， ZEN H， et al. WaveNet： a generative model for raw audio[J/OL]. arXiv Preprint， 2016， 2016： arXiv：1609.03499 （2016-09-12） [2016-09-19]. https：//arxiv.org/abs/1609.03499.
　　[8] ARIK S O， CHRZANOWSKI M， COATES A， et al. Deep Voice： realtime neural texttospeech [J/OL]. arXiv Preprint， 2017， 2017： arXiv：1702.07825 （2017-02-25） [2017-03-07]. https：//arxiv.org/abs/1702.07825.
　　[9] SOTELO J， MEHRI S， KUMAR K， et al. Char2Wav： endtoend speech synthesis [EB/OL].[2018-06-20]. http：//mila.umontreal.ca/wpcontent/uploads/2017/02/endendspeech.pdf.
　　[10] WANG Y， SKERRYRYAN R， STANTON D， et al. Tacotron： towards endtoend speech synthesis [J/OL]. arXiv Preprint， 2017， 2017： arXiv：1703.10135 （2017-03-29） [2017-04-06]. https：//arxiv.org/abs/1703.10135.
　　[11] GRIFFIN D， LIM J S. Signal estimation from modified shorttime Fourier transform [J]. IEEE Transactions on Acoustics Speech and Signal Processing， 1984， 32（2）：236-243. 　　[12] CHOROWSKI J K， BAHDANAU D， SERDYUK D， et al. Attentionbased models for speech recognition [C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge， MA： MIT Press， 2015： 577-585.
　　[13] BAHDANAU D， CHOROWSKI J， SERDYUK D， et al. Endtoend attentionbased large vocabulary speech recognition [C]// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway， NJ： IEEE， 2016： 4945-4949.
　　[14] CHAN W， JAITLY N， LE Q， et al. Listen， attend and spell： a neural network for large vocabulary conversational speech recognition [C]// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway， NJ： IEEE， 2016：4960-4964.
　　[15] VINYALS O， TOSHEV A， BENGIO S， et al. Show and tell： a neural image caption generator[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway， NJ： IEEE， 2015：3156-3164.
　　[16] VINYALS O， KAISER L， KOO T， et al. Grammar as a foreign language[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge， MA： MIT Press， 2014： 2773-2781.
　　[17] LEE J， CHO K， HOFMANN T. Fully characterlevel neural machine translation without explicit segmentation[J/OL]. arXiv Preprint， 2017， 2017： arXiv：1610.03017 （2016-10-10） [2017-05-13]. https：//arxiv.org/abs/1610.03017.
　　[18] SRIVASTAVA R K， GREFF K， SCHMIDHUBER J. Highway networks [J/OL]. arXiv Preprint， 2015， 2015： arXiv：1505.00387 （2015-03-03） [2015-11-03]. https：//arxiv.org/abs/1505.00387.
　　[19] ERRO D， SAINZ I， NAVAS E， et al. Harmonics plus noise model based vocoder for statistical parametric speech synthesis [J]. IEEE Journal of Selected Topics in Signal Processing， 2014， 8（2）：184-194.
　　[20] AOKI N. Development of a rulebased speech synthesis system for the Japanese language using a MELP vocoder [C]// Proceedings of the 2000 10th European Signal Processing Conference. Piscataway， NJ： IEEE， 2000： 1-4.
　　[21] GUNDUZHAN E， MOMTAHAN K. Linear prediction based packet loss concealment algorithm for PCM coded speech [J]. IEEE Transactions on Speech and Audio Processing， 2001， 9（8）： 778-785.
转载注明来源:https://www.xzbu.com/8/view-14941558.htm

查看更多→

基于WaveNet的端到端语音合成方法

相关文章