QUICK REVIEW

[论文解读] Close to Human Quality TTS with Transformer

Naihan Li, Shujie Liu|arXiv (Cornell University)|Sep 19, 2018

Natural Language Processing Techniques参考文献 16被引用 85

一句话总结

本文提出了一种基于Transformer的TTS模型，用多头自注意力机制替代Tacotron2中的RNN和注意力机制，实现了更快的训练速度（提升4.25倍）并有效建模长距离依赖关系。人工评估显示其语音质量接近人类水平，MOS得分为4.39，优于Tacotron2的4.34，接近真人参考的4.44。

ABSTRACT

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

研究动机与目标

解决像Tacotron2这样的端到端TTS模型训练和推理效率低下的问题。
克服RNN在建模序列TTS数据中长距离依赖关系方面的局限性。
通过在编码器和解码器中用多头自注意力替代RNN和原始注意力机制，提升语音质量。
在文本到语音合成任务中实现接近人类水平质量的最先进性能。

提出的方法

用使用多头自注意力的Transformer编码器和解码器模块，替代Tacotron2中的基于RNN的编码器和解码器。
利用多头自注意力并行计算上下文表征，消除序列递归，从而提升训练速度。
通过自注意力机制直接连接任意两个时间步的表征，实现对长距离依赖关系的有效建模。
以音素序列作为输入，生成梅尔频谱图，再通过WaveNet声码器转换为原始音频。
通过调整位置编码和注意力机制，将Transformer架构适配到TTS任务中，以支持自回归生成。
使用梅尔频谱图预测的L1和L2损失组合，进行端到端训练。

实验结果

研究问题

RQ1在TTS中用自注意力替代RNN是否能在不损失性能的前提下提升训练效率？
RQ2与RNN相比，自注意力在TTS序列中建模长距离依赖关系的效果如何？
RQ3基于Transformer的TTS模型在主观人工评估中是否能达到接近人类的语音质量？
RQ4与Tacotron2相比，MOS（平均意见得分）的定量提升是多少？
RQ5与Tacotron2相比，训练过程快了多少？

主要发现

所提出的Transformer TTS模型相比Tacotron2实现了4.25倍的训练速度提升。
在人工评估中，该模型获得了4.39的平均意见得分（MOS），优于Tacotron2。
4.39的MOS非常接近真人参考得分4.44，表明其语音质量已接近人类水平。
由于所有时间步之间都存在直接的注意力连接，该模型能有效建模长距离依赖关系。
多头自注意力的使用使得隐藏状态可并行计算，显著提升了训练效率。
WaveNet声码器成功将生成的梅尔频谱图转换为高保真度音频，对高感知质量有重要贡献。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。