QUICK REVIEW

[論文レビュー] StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han|PubMed|Jun 13, 2023

Speech Recognition and Synthesis参考文献 61被引用数 23

ひとこと要約

StyleTTS 2 は style diffusion と large speech language models を用いた adversarial training を導入し、人間レベルの TTS を実現。LJSpeech の単一話者の人間録音を上回り、VCTK のマルチ話者性能にマッチ、LibriTTS で強力なゼロショット話者適応を実現。

ABSTRACT

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

研究の動機と目的

生成時に参照話声を必要とせず、スタイルを潜在変数として拡散モデルでモデリングすることにより、人間レベルの TTS を実現する。
大規模事前学習済みの音声言語モデルを識別子として活用し、対向訓練を通じて自然さを向上させる。
分微分可能な長さモデリングを用いたエンドツーエンド訓練を可能にし、安定性と合成品質を向上させる。
単一話者およびマルチ話者データセットでの強力な性能を示し、LibriTTS でデータ効率の良いゼロショット話者適応を示す。

提案手法

入力テキストで条件付けられた拡散モデルによりサンプルされる潜在変数として音声スタイルをモデリングする。
固定ボコーダを用いず、テキスト・スタイル・韻律から直接波形を生成するエンドツーエンド訓練を用いる。
メルスペクトログラム生成を waveform デコーダ（HifiGAN または iSTFTNet）に置換し、スタイル条件付けに AdaIN を適用する。
大規模 SLM（例: WavLM）を識別子として取り込み、分微分可能な長さモデルと組み合わせて SLM ベースの対向訓練（L_slm）を可能にする。
予測された音素長をフレームアップサンプリングへ微分可能な方法でマッピングする差分可能な長さモデリングを採用（ガウスアップサンプリングと非パラメトリック手法）。
話者参照埋め込みで拡散を条件付け、話者適応のための適応スタイリングを用いてマルチ話者設定を扱う。

実験結果

リサーチクエスチョン

RQ1スタイル拡散は、参照話声なしで多様で高品質な TTS を実現しつつ効率を維持できるのか？
RQ2大規模事前学習済みの SLM を識別子として用いると、 adversarial training された TTS の自然さと頑健性は向上するか？
RQ3差分可能な長さモデリングを用いたエンドツーエンド訓練は、標準データセットで人間レベルの自然さと話者類似性を生み出すか？
RQ4StyleTTS 2 は単一話者 vs マルチ話者設定、およびゼロショット話者適応でどのように性能が異なるか？
RQ5StyleTTS 2 はアウト・オブ・ディストリビューションなテキストに対して頑健かつ限られた学習データでもデータ効率が高いか？

主な発見

StyleTTS 2 は LJSpeech で ground truth に対して CMOS +0.28 を、NaturalSpeech に対して CMOS +1.07 を達成（p<0.05 および p<<0.01）。
マルチ話者 VCTK では、自然さ CMOS −0.02、類似性 CMOS +0.30（参照と比較してそれぞれ p>0.05 および p<0.1）。
StyleTTS 2 は LJSpeech で MOS 3.83 を達成し、従来モデルを上回り、VCTK では人間レベルの自然さに近い CMOS（ground truth に近い）を示す。
ゼロショット LibriTTS 適応では StyleTTS 2 が自然さで Vall-E を上回り CMOS +0.67（p<0.01）を示す一方、学習データ量は約250x fewer（245 時間対 60k 時間）で済む。
StyleTTS 2 はOODに対して強い頑健性を示し、OOD テキストで MOS-N でベースラインを上回り、未知の内容で自然さの劣化が最小限である。
このアプローチは style diffusion と SLM ベースの対向損失を用いたエンドツーエンド差分可能訓練を可能にし、公開データセットの単一話者およびマルチ話者で人間レベルの TTS を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。