QUICK REVIEW

[論文レビュー] BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz Łajszczak, Guillermo Cámbara|arXiv (Cornell University)|Feb 12, 2024

Natural Language Processing Techniques被引用数 23

ひとこと要約

BASE TTS は 1B-parameter autoregressive TTS model を、100K hours の public-domain data で学習させ、discrete speechcodes と fast, streamable speechcode decoder を使用して、state-of-the-art な自然さと TTS における emergent abilities を達成する。

ABSTRACT

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $ extbf{B}$ig $ extbf{A}$daptive $ extbf{S}$treamable TTS with $ extbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

研究の動機と目的

データとパラメータのスケーリングが大規模言語モデルに匹敵する emergent TTS 能力を生み出すことを実証する。
WavLM で構築された speaker disentanglement を持つ離散的な音声表現（speechcodes）を導入する。
speechcode autoregressive model + streamable decoder が高い自然さとより高速な合成を実現できることを示す。
難解なテキストに対する TTS を評価する emergent abilities テストセットを提供する。

提案手法

テキストと離散的な音声表現（speechcodes）を用いた次のトークン予測としての音声合成のモデリング。
VQ-VAE と WavLM ベースの speechcodes（speaker disentanglement および BPE 圧縮を伴う）を比較する。
テキストと参照話者を条件として speechcodes を予測する GPT-2-風の autoregressive モデル（SpeechGPT）を訓練する。
diffusion ベースのデコーディングを置換して、エンドツーエンドの波形生成を直接行える speechcode デコーダを開発し、ストリーム可能性と速度を確保する。
音声表現を 50 Hz で離散化し、BPE を用いてシーケンス長を短縮し、より長いコンテキストのモデリングを可能にする。

Figure 1: An overview of BASE TTS . The speech tokenizer (1) learns a discrete representation, which is modeled by an autoregressive model (2) conditioned on text and reference speech. The speechcode decoder (3) converts predicted speech representations into a waveform.

実験結果

リサーチクエスチョン

RQ1100K hours で学習した大規模 TTS モデルは、難解なテキストに対して emergent な韻律・言語能力を示すか。
RQ2離散的な音声表現（VQ-VAE 対 WavLM ベース）が、話者属性を切り離しつつ音声学・韻律情報をより適切に捉えるのはどちらか。
RQ3高速・ストリーム可能な speechcode デコーダ（diffusion ベースのデコーダ vs. 高速化）は、合成時間を顕著に短縮しつつ音質を維持または向上させるか。
RQ4モデルとデータのスケールが、感覚的な自然さ、聴取可能性、話者類似性に対して、言語や話者間でどのように影響するか。

主な発見

BASE TTS は公開されている LTTS ベースライン（YourTTS、Bark、TortoiseTTS）に対して最先端の自然さを達成。
WavLM ベースの speechcodes は MUSHRA テストで VQ-VAE の speechcodes と同等またはそれ以上に適合し、スペイン語話者での顕著な向上と英語での同等性を示す。
speechcode デコーダは拡散ベースのデコーダより 3x 速い推論を提供し、品質を低下させることなくエンドツーエンドの波形生成を実用的にする。
emergent abilities はスケーリングとともに現れ: BASE-medium（10K hours, 400M params）でカテゴリ横断の大きな改善を示し、BASE-large（100K hours, 1B params）でさらなる向上を得るが、一部カテゴリは飽和。
七つのカテゴリ（複合名詞、感情、外来語、パラリンガリ/パラ言語学、句読点、質問、統語的複雑さ）を含む emergent-abilities テストセットを提案・言語学の専門家によって評価。
このモデルは多言語・多話者条件下で高い自然さと頑健な性能を達成し、複雑なテキストに対する合成時間の短縮と韻律改善を実現。

Figure 2: WavLM-based speech tokenizer. The proposed architecture encourages disentanglement of speaker and content information.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。