QUICK REVIEW

[论文解读] VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Sanyuan Chen, Shujie Liu|arXiv (Cornell University)|Jun 8, 2024

Speech Recognition and Synthesis被引用 8

一句话总结

VALL-E 2 通过引入重复感知采样和分组代码建模，推动神经编解码语言模型在 LibriSpeech 和 VCTK 上实现零样本 TTS 的人类同等水平。

ABSTRACT

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. See https://aka.ms/valle2 for demos of VALL-E 2.

研究动机与目标

推动零样本 TTS 的改进，以在无目标说话人数据的情况下实现人类同等水平的语音克隆。
提出通过新颖的采样和分组策略实现稳定、高效的解码与长序列建模。
展示编解码器-语言建模方法在基准数据集上能够达到与人类相当的表现。
展示在具有挑战性的句子和重复短语上的鲁棒性。
强调训练数据需求的简易性以及潜在应用与风险。

提出的方法

引入重复感知采样，根据解码历史中的重复情况在随机采样和 nucleus 采样之间进行切换。
提出分组代码建模，将编解码代码分组并将每组建模为一个帧以缩短序列。
采用混合自回归(AR)与非自回归(NAR) Transformer 架构进行编解码代码生成。
在使用 Libriheavy 数据、以 Encodec 进行分词、Vocos 进行解码的逐话语音-转写对上进行训练。
为 AR 和 NAR 组件制定 grouped-code 似然目标，以在给定文本和提示代码时最大化条件对数似然。
通过使用来自未见说话人的语音提示来进行提示，从而应用零样本 TTS 以生成目标代码并合成语音。

实验结果

研究问题

RQ1VALL-E 2 是否能够在标准基准上实现零样本 TTS 的人类同等水平？
RQ2重复感知采样和分组代码建模是否提高了基于编解码的 TTS 的稳定性、速度和长序列建模？
RQ3简单的逐话语音-转写数据是否足以训练出高质量的零样本 TTS 模型？
RQ4系统是否在域内和域外数据集上保持说话人相似性、自然度和鲁棒性？
RQ5模型是否能够以稳定的高质量合成应对具有挑战性或重复性的句子？

主要发现

VALL-E 2 在 LibriSpeech 和 VCTK 上在人健壮性、自然度和相似性基准上达到人类同等水平。
分组代码建模缩短序列长度并加速推理，同时缓解长上下文问题。
重复感知采样稳定解码并避免无限循环且不增加延迟。
该模型仅使用逐话语音-转写成对数据进行训练就获得了强大的零-shot TTS 性能。
VALL-E 2 在复杂句子和重复短语上的语音合成显示出鲁棒性。
解码速度可以在几乎不损失性能的情况下显著加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。