QUICK REVIEW

[论文解读] BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz Łajszczak, Guillermo Cámbara|arXiv (Cornell University)|Feb 12, 2024

Natural Language Processing Techniques被引用 23

一句话总结

BASE TTS 提供一个 1B 参数的自回归 TTS 模型，在公开领域数据上训练 100K 小时，使用离散 speechcodes 和一个快速、可流式的 speechcode 解码器，以实现最先进的自然度和 TTS 中的涌现能力。

ABSTRACT

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $ extbf{B}$ig $ extbf{A}$daptive $ extbf{S}$treamable TTS with $ extbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

研究动机与目标

证明通过扩大数据和参数规模可以获得与大型语言模型相当的涌现 TTS 能力。
引入基于 WavLM、具备说话人解耦的离散语音表示（speechcodes）。
展示一个 speechcode 自回归模型加上流式解码器如何实现高自然度和更快的合成速度。
提供一个涌现能力测试集，用于在具有挑战性的文本上评估 TTS。

提出的方法

将文本到语音建模视为对文本标记的下一个 token 预测，随后再进行离散语音表示（speechcodes）的预测。
比较两种语音标记器：VQ-VAE 与基于 WavLM 的 speechcodes，具备说话人解耦与 BPE 压缩。
训练一个 GPT-2 风格的自回归模型（SpeechGPT），在文本和参考说话人条件下预测 speechcodes。
开发一个直接端到端生成波形的 speechcode 解码器，替代基于扩散的解码以实现流式和快速。
以 50 Hz 将语音表示离散化并使用 BPE 降低序列长度、实现更长上下文建模。

Figure 1: An overview of BASE TTS . The speech tokenizer (1) learns a discrete representation, which is modeled by an autoregressive model (2) conditioned on text and reference speech. The speechcode decoder (3) converts predicted speech representations into a waveform.

实验结果

研究问题

RQ1在 100K 小时训练的大规模 TTS 模型是否在具有挑战性的文本上表现出涌现的韵律和语言能力？
RQ2哪些离散语音表示（VQ-VAE vs. 基于 WavLM 的）能够更好地捕捉音位和韵律信息，同时实现说话人属性的解耦？
RQ3一个快速、可流式的 speechcode 解码器（相对于扩散解码）是否能在显著缩短合成时间的同时保持或提高语音质量？
RQ4模型和数据规模如何影响主观自然度、清晰度以及跨语言、跨说话人的说话人相似度？

主要发现

BASE TTS 在公开可用的 LTTS 基线（YourTTS、Bark、TortoiseTTS）面前达到最先进的自然度。
基于 WavLM 的 speechcodes 在 MUSHRA 测试中与 VQ-VAE speechcodes 相当甚至优于 VQ-VAE，西班牙语人声显著提升，英语达到同等水平。
speechcode 解码器提供比扩散解码器快 3 倍的推理速度且不降质量，使端到端波形生成成为可行。
随着规模扩大出现涌现能力：BASE-medium（10K 小时，4亿参数）在多个类别上有大幅提升；BASE-large（100K 小时，1B 参数）带来进一步提升，尽管部分类别趋于饱和。
提出并由语言学专家评估的跨七大类的涌现能力测试集（复合名词、情感、外来词、旁观语言、标点、问句、句法复杂度）。
该模型在多语言、多说话人条件下实现高自然度和鲁棒性能，且在复杂文本合成中合成时间更短、韵律更好。

Figure 2: WavLM-based speech tokenizer. The proposed architecture encourages disentanglement of speaker and content information.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。