QUICK REVIEW

[論文レビュー] ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Rongjie Huang, Zhou Zhao|arXiv (Cornell University)|Jul 13, 2022

Speech Recognition and Synthesis被引用数 21

ひとこと要約

ProDiff はジェネレーター型拡散を直接用いてクリーンなデータを予測し、知識蒸留を用いて拡散ステップを半減させ、1つのGPUで2回のイテレーションと約24倍リアルタイムを超える音声合成を実現します。

ABSTRACT

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}

研究の動機と目的

拡散パラメータ化を評価し、サンプリング速度と品質のボトルネックを特定する。
ProDiff を提案し、反復回数を削減する知識蒸留を用いたジェネレーター型拡散モデル。
ProDiff がはるかに少ないサンプリングステップでも高忠実度を維持しつつ多様性を保つことを示す。
標準ベンチマークとアブレーションで最先端の TTS モデルと比較評価する。

提案手法

TTS デノイジングにおける勾配ベースと生成器ベースの拡散パラメータ化を比較する。
スコアマッチング勾配推定を回避する生成器ベースのデノイジングを導入する。
N-step の教師（DDIM）からの知識蒸留で N/2 ステップの学生を訓練し、ターゲット側の分散を低減する。
スペクトログラムデノイザーと専用訓練損失（再構成、SSIM、分散項を組み合わせた）を用いた FastSpeech 2 アーキテクチャ上に ProDiff を構築する。
4-step 教師から蒸留して 2-step 学生を得る DDIM ベースのターゲットで訓練し、追加損失（SSIM、持続時間/ピッチ/エネルギー）を導入して品質を高める。
推論は各ステップで x0 を予測し事後分布から x_{t-1} を再構成して推論を行い、ボコーダで波形を合成する。

実験結果

リサーチクエスチョン

RQ1ジェネレーター型拡散は、TTS において勾配ベースの拡散と比較してサンプリングを速くし、音声品質を保持または改善できるか。
RQ2N-step 教師から N/2-step 学生への知識蒸留は訓練を安定化させ、多様性を損なうことなく推論を加速できるか。
RQ3ProDiff は標準ベンチマークにおいて、品質・速度・多様性の面で自動回帰および非自動回帰の TTS モデルとどう比較されるか。

主な発見

方法	MOS	MCD	STOI	PESQ	NDB	JS	RTF
GT	4.41 ± 0.06	/	/	/	/	/	/
GT(voc.)	4.25 ± 0.06	1.08	0.95	3.18	0.23	0.002	/
Tacotron 2	3.90 ± 0.07	5.30	0.18	1.14	0.88	0.022	/
FastSpeech 2	3.92 ± 0.05	4.06	0.23	0.99	0.79	0.021	0.01
GANSpeech	4.00 ± 0.05	4.02	0.21	0.96	0.73	0.104	0.02
Glow-TTS	4.01 ± 0.07	4.35	0.19	1.00	0.74	0.012	0.01
Grad-TTS (64 steps)	4.05 ± 0.06	3.36	0.19	1.48	0.57	0.023	0.19
DiffSpeech (128 steps)	4.09 ± 0.06	3.48	0.83	2.40	0.67	0.008	1.11
ProDiff (2 steps)	4.08 ± 0.07	3.15	0.85	2.55	0.69	0.012	0.04

ProDiff はわずか 2 拡散ステップで高品質なメルスペクトログラムを達成。
生成器ベースのパラメトライゼーションは、低ステップ数でのサンプリング加速に対する頑健性の点で勾配ベースのパラメトライゼーションより優れている。
4-step 教師から 2-step 学生への知識蒸留は分散を低減し、サンプリングを桁違いに速くする。
LJSpeech において、ProDiff は 2 ステップで複数のベースラインと同等以上の知覚品質と多様性を達成し、単一の 2080Ti GPU で約 24x のリアルタイム超えサンプリングを提供。
ProDiff は数百ステップを使用する最先端モデルと競争力のあるサンプル品質と多様性を維持し、多声設定への拡張性も持つ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。