QUICK REVIEW

[論文レビュー] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Kai Shen, Zeqian Ju|arXiv (Cornell University)|Apr 18, 2023

Speech Recognition and Synthesis被引用数 37

ひとこと要約

NaturalSpeech 2 は連続潜在ベクトルを用いたニューラル音声コーデックとテキスト条件付き潜在拡散モデルを組み合わせ、ゼロショット、多話者、歌唱合成を高い自然さと頑健性で実現します。

ABSTRACT

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.

研究の動機と目的

単一話者を超える多様で高品質なTTSの需要に対応するため、大規模な多話者データセットと野外音声をスケールさせる。
トークンベースの自己回帰のボトルネックを、連続潜在表現と拡散を用いて排除する。
音声プロンプティングと長さ/ピッチ条件付けを通じて強力なゼロショット能力を実現する。

提案手法

再現性の高い再構成のため、音声を連続潜在ベクトルへ変換するニューラルオーディオコーデックを訓練する（RVQベース）。
音素エンコーダと長さ/ピッチ予測子からなる事前分布を介してテキスト条件付けされた潜在ベクトルを生成する非自己回帰潜在拡散モデルを使用する。
拡散モデルと長さ/ピッチ予測子の両方でゼロショット合成のコンテキスト学習を可能にする音声プロンプティング機構を組み込む。
フォワードSDEとリバースSDE/ODEを用いた拡散を定式化し、diff、dur、pitchとRVQのクロスエントロピー正則化項を組み合わせた損失を最適化する。
MLS English の 44k 時間、話者は男性2,742人・女性2,748人で訓練し、ゼロショットを LibriSpeech test-clean と VCTK で評価する。

Figure 1 : The overview of NaturalSpeech 2, with an audio codec encoder/decoder and a latent diffusion model conditioned on a prior (a phoneme encoder and a duration/pitch predictor). The details of in-context learning in the duration/pitch predictor and diffusion model are shown in Figure 3 .

実験結果

リサーチクエスチョン

RQ1連続的ニューラルコーデック潜在量に対する潜在拡散が、ゼロショットの多話者TTSにおいて自然さと頑健性を達成できるか？
RQ2明示的な歌唱プロンプトなしで、音声プロンプティングがゼロショット話者の識別とスタイル（歌唱を含む）の文脈内学習を改善するか？
RQ3自然な音声 2 は、自己回帰/離散コードのベースラインと比較して、ゼロショット設定で韻律忠実度と頑健性がどうか？

主な発見

Setting	LibriSpeech CMOS	VCTK CMOS
正解音声	+0.04	-0.30
YourTTS	-0.65	-0.58
NaturalSpeech 2	0.00	0.00

NaturalSpeech 2 は高い自然さを達成し、LibriSpeech の正解音声と同等の自然さを示し、VCTK でも CMOS テストで競争力がある。
プロソディの類似度が、プロンプトと正解音声の双方において YourTTS を一貫して上回り、話者類似度 (SMOS) もより強い。
音声プロンプトだけでゼロショット歌唱合成が可能で、明示的な歌唱プロンプトなしで新しい音色を実現。
未知の話者（LibriSpeech test-clean と VCTK）で強力なゼロショット性能を達成し、自己回帰アプローチより拡散によって頑健性を向上。
CMOS テストでは、NaturalSpeech 2 が LibriSpeech +0.04 に対し正解音声、VCTK に対し -0.30、YourTTS はそれぞれ -0.65 と -0.58。

Figure 2 : The neural audio codec consists of an encoder, a residual vector-quantizer (RVQ), and a decoder. The encoder extracts the frame-level speech representations from the audio waveform, the RVQ leverages multiple codebooks to quantize the frame-level representations, and the decoder takes the

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。