QUICK REVIEW

[论文解读] NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu Tan, Jiawei Chen|arXiv (Cornell University)|May 9, 2022

Speech Recognition and Synthesis被引用 35

一句话总结

NaturalSpeech 是一个端到端的 TTS 系统，通过基于 VAE 的文本到波形框架，结合音素预训练、可微分时长建模、双向先验/后验流，以及基于记忆的 VAE，在 LJSpeech 上达到与人类听感相当的人类级质量，并显示与人类录音在 CMOS 上不可区分。

ABSTRACT

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

研究动机与目标

使用统计显著性在主观评估中定义 TTS 的人类级质量。
确定在测试集上判断人类级质量的指南。
开发一个端到端的 TTS 系统，在基准数据集上缩小与人类录音的差距。
证明所提系统在 LJSpeech 上实现了与人类语音在 CMOS 上不可区分。

提出的方法

使用变分自编码器将语音 x 映射到逐帧潜在变量 z，并从 z 重建 x（p(x|z)）。
使用具大规模预训练的音素编码器从文本 y 预测逐帧先验 p(z|y)。
引入一个可微分的 durator，使音素级先验对齐到逐帧后验。
通过流模型加入双向先验/后验模块，以增强先验并简化后验。
应用基于记忆的 VAE，通过对记忆库的关注来重建波形，从而降低后验复杂性。
端到端训练，包含多项损失项 L_bwd、L_fwd、L_rec、L_e2e，以及在适当位置的 soft-DTW。

实验结果

研究问题

RQ1TTS 中何为人类级别质量，且如何用统计方法进行评判？
RQ2一个端到端 TTS 系统是否能在标准数据集上逼近或与人类录音相匹配？
RQ3哪些结构组件最有效地缩小与人类语音的差距（音素预训练、可微分 durator、双向先验/后验、记忆化 VAE）？

主要发现

System	MOS	CMOS
Human Recordings	4.58±0.13	0
NaturalSpeech	4.56±0.13	-0.01
FastSpeech 2 + HiFiGAN	4.32±0.15	-0.33
Glow-TTS + HiFiGAN	4.34±0.13	-0.26
Grad-TTS + HiFiGAN	4.37±0.13	-0.24
VITS	4.43±0.13	-0.20

NaturalSpeech 在 LJSpeech 上与人类录音相比达到 -0.01 CMOS，p >> 0.05，表示与人类语音没有统计显著差异。
在 MOS 上，NaturalSpeech 与人类录音相匹配（4.56±0.13 vs 4.58±0.13，p = 0.7145）。
与 FastSpeech 2 + HiFiGAN、Glow-TTS + HiFiGAN、Grad-TTS + HiFiGAN、VITS 比较，NaturalSpeech 获得更高的 MOS（4.56 vs 4.32–4.49）且 CMOS 更接近零（0 vs -0.20 到 -0.33）。
消融研究显示每个关键组件（音素预训练、可微分 durator、双向先验/后验、VAE 中的记忆）都对 CMOS 有贡献，移除时损失在 -0.06 到 -0.12 之间。
NaturalSpeech 提供更快或可比的推理速度（RTF ~ 0.013），同时带来更优的发声质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。