QUICK REVIEW

[论文解读] Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

Jonathan Shen, Jia Ye|arXiv (Cornell University)|Oct 8, 2020

Neural Networks and Applications参考文献 52被引用 73

一句话总结

本论文用显式时长预测器和高斯上采样替代 Tacotron 2 的注意力机制，从而实现稳健、可控的 TTS，支持监督、半监督或无监督时长建模。

ABSTRACT

This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model, replacing the attention mechanism with an explicit duration predictor. This improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2. The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time. When accurate target durations are scarce or unavailable in the training data, we propose a method using a fine-grained variational auto-encoder to train the duration predictor in a semi-supervised or unsupervised manner, with results almost as good as supervised training.

研究动机与目标

在基于注意力的神经 TTS 中激发鲁棒性问题，并降低重复或长停顿等失败风险。
引入 Non-Attentive Tacotron (NAT)，使用时长预测器和高斯上采样替代注意力。
通过 FVAE 相关对齐来实现带监督、半监督或无监督时长信息的训练。
在推断阶段提供控制朗读速率和每个音素时序的方法，同时保持音质。
提出用于大规模鲁棒性评估的鲁棒自动化评价指标（UDR 和 WDR）。

提出的方法

用时长预测器和高斯上采样替换 Tacotron 2 的注意力，以对编码器输出进行上采样。
预测每个标记的时长 d 和一个用于高斯上采样的范围参数 sigma。
通过以标记段为中心的高斯混合来上采样编码器输出，形成对解码器对齐的输入。
通过结合 mel-spectrogram 重建损失和时长预测损失的损失函数进行训练（L_spec 和 L_dur）。
使用 FVAE 提取来自目标声谱的与标记对齐的潜在特征以支持半监督/无监督时长建模。
通过操纵预测的时长，在推断时允许对话语的全局节奏与逐音素时序进行控制，同时保持质量。

实验结果

研究问题

RQ1通过时长预测器和高斯上采样的显式时长建模能否在鲁棒性方面优于基于注意力的 Tacotron 2？
RQ2在自然度和鲁棒性方面，无监督或半监督时长建模与完全监督训练相比有何差异？
RQ3NAT 在全局和逐音素层面的可控节奏在不降低质量的情况下能达到何种程度？
RQ4除了 MOS 之外，在大规模鲁棒性方面有哪些有效的评估指标（如 UDR 和 WDR）？

主要发现

模型	LibriTTS UDR (%)	LibriTTS WDR (%)	web-long UDR (%)	web-long WDR (%)
Tacotron 2 w/ LSA	16.96	0.4	46.04	4.4
Tacotron 2 w/ GMMA	3.812	0.1	6.157	1.3
Non-Attentive Tacotron Supervised	0.005	0.1	0.011	1.0
Non-Attentive Tacotron Semi-supervised	0.034	0.3	0.035	1.7
Non-Attentive Tacotron Unsupervised	0.181	0.4	0.291	1.9

使用高斯上采样的 NAT 在 MOS 测试中达到与 Tacotron 2 (GMMA) 相当的自然度。
高斯上采样在鲁棒性方面显著优于原始上采样和基于注意力的基线。
有监督的 NAT 产生极高的鲁棒性（低 UDR/WDR）且接近真实 MOS。
采用 FVAE 方法的半监督和无监督时长建模保留了大部分自然度和鲁棒性，优于没有 FVAE 的纯无监督方法。
自回归解码器对高质量合成仍然关键；非自回归解码器在自然度方面仍逊于 NAT。
NAT 使推断时能够实现全局和细粒度的节奏控制，同时在监督设置下不丧失质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。