[论文解读] FastSpeech: Fast, Robust and Controllable Text to Speech
FastSpeech 引入一个非自回归的基于 Transformer 的 TTS,它通过长度调节器和时长预测器并行生成梅尔光谱,显著加速并提高鲁棒性,且可控语速。
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x. Therefore, we call our model FastSpeech.
研究动机与目标
- 解决推理慢、鲁棒性问题(词跳过/重复)以及自回归 TTS 模型缺乏可控性。
- 提出基于前馈 Transformer(FFT)的并行梅尔光谱生成框架。
- 通过时长预测器和长度调节器的音素时长对齐,使梅尔光谱长度匹配。
- 通过调整音素时长实现可控语音合成,以改变语速和韵律。
提出的方法
- 使用带自注意力和一维卷积块的前馈 Transformer(FFT)将音素到梅尔光谱的转换。
- 引入长度调节器,根据预测的音素时长对音素表示进行上采样以匹配梅尔光谱长度。
- 使用自回归教师模型训练的时长预测器来预测音素时长,利用对角注意对齐推导真实时长。
- 通过从自回归 Transformer TTS 模型(教师)到并行(学生)模型的序列级知识蒸馏来训练 FastSpeech。
- 为从生成的梅尔光谱进行端到端音频合成应用 WaveGlow vocoder。
实验结果
研究问题
- RQ1非自回归、并行生成梅尔光谱是否能达到与自回归模型相当的语音质量?
- RQ2长度调节器和准确的音素时长预测是否能减少词跳过和重复错误?
- RQ3与自回归 TTS 相比,在梅尔光谱生成和端到端合成上能实现多大程度的加速?
- RQ4通过音素时长在多大程度上可以实现对语速和韵律的可控调整?
主要发现
| 方法 | MOS (mean ± CI) | 备注 |
|---|---|---|
| GT | 4.41 ± 0.08 | 真实音频 |
| GT (Mel + WaveGlow) | 4.00 ± 0.09 | 梅尔光谱 + WaveGlow |
| Tacotron 2 (Mel + WaveGlow) | 3.86 ± 0.09 | 自回归 TTS 基线 |
| Merlin (WORLD) | 2.40 ± 0.13 | 参数化 TTS |
| Transformer TTS (Mel + WaveGlow) | 3.88 ± 0.09 | 自回归 Transformer TTS |
| FastSpeech (Mel + WaveGlow) | 3.84 ± 0.08 | 提出的模型 |
- FastSpeech 在语音质量上几乎达到自回归 Transformer TTS 的水平(MOS 接近 Transformer TTS)。
- 与自回归 Transformer TTS 相比,梅尔光谱生成加速了 269.4 倍,端到端合成加速了 38.3 倍。
- FastSpeech 在困难测试句子上几乎消除了词跳过和重复(0% 错误)。
- 通过音素时长调控,语速可以在 0.5x 到 1.5x 之间平滑调整。
- 在单词之间添加停顿可通过时长控制改善韵律。
- 消融显示 1D 卷积和序列级知识蒸馏对性能有积极贡献。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。