QUICK REVIEW

[论文解读] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Kai Shen, Zeqian Ju|arXiv (Cornell University)|Apr 18, 2023

Speech Recognition and Synthesis被引用 37

一句话总结

简短总结: NaturalSpeech 2 使用带有连续潜在向量的神经音频编解码器和以文本为条件的潜潜扩散模型，以实现零样本、多说话人和歌唱合成，具有高自然度和鲁棒性。

ABSTRACT

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.

研究动机与目标

通过扩展到大规模的多说话人数据集和真实世界音频，解决超越单一说话人、具有多样性和高质量的文本转语音需求。
通过使用连续潜在表示和扩散来消除基于标记的自回归瓶颈。
通过语音提示和时长/音高条件来实现强大的零样本能力。

提出的方法

训练一个神经音频编解码器，将语音转换为连续潜在向量（基于 RVQ）以实现高保真重建。
使用非自回归的潜在扩散模型，通过前验（包含音素编码器和时长/音高预测器）对文本进行条件生成潜在向量。
引入一个语音提示机制，使扩散模型和时长/音高预测器在零样本合成中实现上下文学习。
用前向 SDE 和反向 SDE/ODE 形式化扩散，优化包含 diff、dur、pitch 和 RVQ 交叉熵正则化项的综合损失。
在 MLS English 的 44k 小时数据集上训练，包含 2,742 名男性和 2,748 名女性说话人，在 LibriSpeech test-clean 和 VCTK 上评估零样本。

Figure 1 : The overview of NaturalSpeech 2, with an audio codec encoder/decoder and a latent diffusion model conditioned on a prior (a phoneme encoder and a duration/pitch predictor). The details of in-context learning in the duration/pitch predictor and diffusion model are shown in Figure 3 .

实验结果

研究问题

RQ1在零样本、多说话人文本转语音中，基于连续神经编解码器潜在向量的潜在扩散是否能实现自然度和鲁棒性？
RQ2语音提示是否在没有显式歌唱提示的情况下，改善零样本说话人身份与风格（包括歌唱）的上下文学习？
RQ3在零样本设置中，NaturalSpeech 2 相较自回归/离散编码基线在韵律保真度和鲁棒性方面如何？

主要发现

设置	LibriSpeech CMOS	VCTK CMOS
Ground Truth	+0.04	-0.30
YourTTS	-0.65	-0.58
NaturalSpeech 2	0.00	0.00

NaturalSpeech 2 实现了高自然度，在 LibriSpeech 上与真值相当，在 VCTK 上在 CMOS 测试中具竞争力。
它在韵律与提示/真值的相似性方面始终优于基线 YourTTS，且展现更强的说话人相似性（SMOS）。
仅使用一个语音提示即可实现零样本歌唱合成，能够在没有明确歌唱提示的情况下产生新颖音色。
该模型在未见说话人（LibriSpeech test-clean 和 VCTK）上的零样本表现强劲，并通过扩散相对于自回归方法提升鲁棒性。
在 CMOS 测试中，NaturalSpeech 2 的 LibriSpeech 为 +0.04 对 Ground Truth，VCTK 为 -0.30，而 YourTTS 分别为 -0.65 和 -0.58。

Figure 2 : The neural audio codec consists of an encoder, a residual vector-quantizer (RVQ), and a decoder. The encoder extracts the frame-level speech representations from the audio waveform, the RVQ leverages multiple codebooks to quantize the frame-level representations, and the decoder takes the

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。