QUICK REVIEW

[论文解读] Neural Speech Synthesis with Transformer Network

Naihan Li, Shujie Liu|arXiv (Cornell University)|Sep 19, 2018

Speech Recognition and Synthesis被引用 39

一句话总结

该论文提出了一种基于Transformer的端到端文本到语音（TTS）模型，用多头自注意力机制替代Tacotron2中的循环神经网络（RNNs），实现了并行训练并提升了长距离依赖关系的建模能力。该模型在平均意见得分（MOS）上达到4.39，非常接近人类语音质量（4.44），且训练速度相比Tacotron2提升了4.25倍。

ABSTRACT

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

研究动机与目标

解决如Tacotron2等基于RNN的TTS模型在效率和长距离依赖关系建模方面的局限性。
将原本专为机器翻译设计的Transformer架构适配至端到端文本到语音合成任务中。
通过实现编码器和解码器隐藏状态计算的完全并行化，提升训练速度。
通过自注意力机制建模长距离依赖关系，提升语音韵律与音频质量。
使用音素输入和WaveNet声码器，构建一个完全端到端的TTS系统，实现接近人类水平的语音质量。

提出的方法

用多头自注意力机制替代Tacotron2中基于RNN的编码器和解码器，以实现隐藏状态的并行计算。
在编码器和解码器中引入多头自注意力机制，以在无需序列递推的情况下捕捉长距离依赖关系。
以音素序列作为输入，端到端生成梅尔频谱图，随后通过WaveNet声码器合成波形。
采用改进版的标准Transformer架构，引入相对位置编码，以更好地建模语音中的时间依赖关系。
应用残差连接和层归一化，以稳定训练过程并改善梯度流动。
采用序列到序列框架进行训练，对梅尔频谱图预测结果使用交叉熵损失。

实验结果

研究问题

RQ1Transformer架构能否有效替代TTS中的RNN，以提升训练效率？
RQ2在编码器和解码器中引入多头自注意力机制，是否能改善语音序列中长距离依赖关系的建模？
RQ3与Tacotron2相比，基于Transformer的TTS模型能否实现接近人类水平的语音质量？
RQ4所提模型的训练速度与Tacotron2相比有何差异？
RQ5哪些超参数（如层数和头数）对模型性能和稳定性影响最大？

主要发现

所提出的Transformer TTS模型在平均意见得分（MOS）上达到4.39，非常接近人类参考质量（4.44）。
在CMOS指标上，该模型比Tacotron2高出0.048，展现出最先进性能。
由于隐藏状态计算的完全并行化，训练速度相比Tacotron2提升了4.25倍。
增加网络层数（例如从3层增至6层）有助于更好地建模梅尔频谱图中的高频区域，从而提升音频质量。
批量大小被发现是影响训练稳定性的关键因素，尤其在深层模型中更为显著。
通过自注意力机制，模型成功缓解了长距离依赖问题，实现了任意两个时间步之间的直接注意力连接。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。