QUICK REVIEW

[論文レビュー] Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Ziyue Karen Jiang, Jinglin Liu|arXiv (Cornell University)|Jul 14, 2023

Speech Recognition and Synthesis被引用数 8

ひとこと要約

Mega-TTS 2 は、多様な長さの音声プロンプトを取り扱うことで、任意長プロンプトを用いたゼロショット多話者 TTS を実現する。マルチリファレンス・ティンバー・エンコーダ、プロソディ・言語モデル、自己回帰型の長さモデルを導入し、 unseen speaker のアイデンティティ・韻律・自然さを改善する。

ABSTRACT

Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts, which significantly reduces the data and computation requirements for voice cloning by skipping the fine-tuning process. However, the prompting mechanisms of zero-shot TTS still face challenges in the following aspects: 1) previous works of zero-shot TTS are typically trained with single-sentence prompts, which significantly restricts their performance when the data is relatively sufficient during the inference stage. 2) The prosodic information in prompts is highly coupled with timbre, making it untransferable to each other. This paper introduces Mega-TTS 2, a generic prompting mechanism for zero-shot TTS, to tackle the aforementioned challenges. Specifically, we design a powerful acoustic autoencoder that separately encodes the prosody and timbre information into the compressed latent space while providing high-quality reconstructions. Then, we propose a multi-reference timbre encoder and a prosody latent language model (P-LLM) to extract useful information from multi-sentence prompts. We further leverage the probabilities derived from multiple P-LLM outputs to produce transferable and controllable prosody. Experimental results demonstrate that Mega-TTS 2 could not only synthesize identity-preserving speech with a short prompt of an unseen speaker from arbitrary sources but consistently outperform the fine-tuning method when the volume of data ranges from 10 seconds to 5 minutes. Furthermore, our method enables to transfer various speaking styles to the target timbre in a fine-grained and controlled manner. Audio samples can be found in https://boostprompt.github.io/boostprompt/.

研究の動機と目的

prompts が短いものより長く情報量が多い場合のゼロショット TTS 性能向上の動機。
任意長のプロンプトから豊かなティンバーとプロソディを抽出するメカニズムの開発。
長さを考慮した duran の推定で自然さを高めるin-context learning の導入。
複数ソースからのプロソディを補間して表現力を制御しつつ、目標ティンバーを保持する方法の提供。
標準データセットと評価指標で最先端のゼロショット TTS システムと対比して評価。

提案手法

複数の参照音声から微細なティンバーを捉えるマルチリファレンス・ティンバー・エンコーダ（MRTE）を導入。
任意長プロンプトから圧縮プロソディ符号を自己回帰的に生成するプロソディ・言語モデル（PLM）を訓練。
音素レベルでの自己回帰型長さモデル（ADM）を提案し、長さ情報を取り入れたデュレーション予測を実現。
複数の PLM からのプロソディ出力を補間して任意源プロンプトを実装し、表現力を制御しつつターゲットティンバーを維持。
content/ティンバー/prosody encoding および識別器を備える VQ-GAN ベースの TTS ボディーを採用し、Waveform 合成には HiFi-GAN を使用。
客観指標（WER、話者類似度、MOS）と LibriSpeech test-clean での主観的 MOS テストを用いて評価。

実験結果

リサーチクエスチョン

RQ1Mega-TTS 2 は任意長プロンプトを与えられた際に unseen speakers に対して高品質かつアイデンティティを保持した音声を合成できるか。
RQ2 音声プロンプトの長さを増やすとゼロショット TTS における話者類似性と韻律自然性は向上するか。
RQ3 MRTE と ADM はベースラインと比べて性能に顕著な貢献をするか。
RQ4 複数ソースからのプロソディ補間は、ティンバーを損なわずに表現力を制御できるか。
RQ5 クロスリンガルシナリオを扱い、話者アイデンティティの追跡を維持できるか。

主な発見

Model	WER (↓)	SIM (↑)	MOS-Q (↑)	MOS-S (↑)	MOS-P (↑)
Ground truth	2.47%	-	4.35	4.31	4.39
YourTTS	8.64%	0.909	3.81	3.37	3.53
VALL-E	-	-	3.90	3.88	3.98
Mega-TTS	2.96%	0.936	4.08	4.02	4.11
Ours w/ 1 sent + 1s	2.73%	0.920	4.15	3.87	3.96
Ours w/ 10 sent + 5s	2.72%	0.940	4.17	3.97	4.04
Ours w/ 50 sent + 10s	2.73%	0.942	4.15	4.02	4.11

プロンプト長が増えるほど Mega-TTS 2 は話者類似性とプロソディ自然性を改善。
任意長プロンプトと MRTE/PLM/ADM の併用は、LibriSpeech test-clean において基準よりも高い SIM および MOS-P/MOS-S のスコアを達成。
より長いプロンプト（50 文 + 10s）で、特定の指標において基準を近づけるまたは上回る結果となり、長い音響プロンプトの価値を示唆。
アブレーションにより ADM と MRTE が、それぞれ MOS-P および MOS-S の向上に寄与することを示した。
複数ソースのプロソディ補間による任意源プロンプトは、表現力を制御しつつターゲットティンバーを保持可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。