QUICK REVIEW

[論文レビュー] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Jaehyeon Kim, Jungil Kong|arXiv (Cornell University)|Jun 11, 2021

Speech Recognition and Synthesis参考文献 38被引用数 121

ひとこと要約

tldr: VITS は conditional VAE、正規化フロー、 adversarial training を用いて自然な話声を生成する並列のエンドツーエンド TTS モデルを提示し、分岐的なリズムの多様性を持つ確率的な継続予測子を併用して、LJ Speech の ground truth に近い MOS と VCTK における強力な多話者性能を達成します。

ABSTRACT

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

研究の動機と目的

Bridge two-stage TTS limitations by enabling end-to-end training with latent variable modeling.
Increase expressive power of the prior and posterior distributions via normalizing flows.
Model and utilize a stochastic duration predictor to capture diverse rhythms in speech.
Leverage adversarial training to enhance waveform realism beyond mel-spectrogram reconstructions.
Demonstrate superior quality and multi-speaker capabilities compared to public two-stage systems.

提案手法

Formulates TTS as a conditional VAE with prior p(z|c) enhanced by a normalizing flow f_theta for expressive latent space.
Uses a posterior encoder q_phi(z|x_lin) and reconstruction in mel-spectrogram domain with L1 loss.
Estimates text–speech alignment A via Monotonic Alignment Search (MAS) adapted to maximize ELBO.
Introduces a stochastic duration predictor based on variational dequantization and variational data augmentation to model speech rhythm.
Incorporates adversarial training with a HiFi-GAN-like decoder and a discriminator D, plus a feature-matching loss for stable, high-quality waveform generation.
Trains with windowed generator training to improve efficiency while enabling end-to-end generation.

実験結果

リサーチクエスチョン

RQ1Can a conditional VAE with a flow-based prior produce high-quality end-to-end waveform synthesis without intermediate representations?
RQ2Does MAS-based alignment estimation integrated into ELBO optimization yield better alignments for text-to-speech?
RQ3Can a stochastic duration predictor deliver diverse rhythms in parallel TTS while maintaining naturalness?
RQ4What is the impact of adversarial training and feature matching on end-to-end TTS synthesis quality?
RQ5How well does the proposed end-to-end model generalize to multi-speaker corpora?

主な発見

モデル	MOS（CI）
Ground Truth	4.46 ( ±0.06)
Tacotron 2 + HiFi-GAN	3.77 ( ±0.08)
Tacotron 2 + HiFi-GAN (Fine-tuned)	4.25 ( ±0.07)
Glow-TTS + HiFi-GAN	4.14 ( ±0.07)
Glow-TTS + HiFi-GAN (Fine-tuned)	4.32 ( ±0.07)
VITS (DDP)	4.39 ( ±0.06)
VITS	4.43 ( ±0.06)

VITS は LJ Speech で ground truth に comparable MOS を達成し、公開されている二段階システムを上回る。
prior エンコーダの正規化フローは MOS を著しく改善（取り除くと 1.52 MOS 減少）。
後方推定入力として線形スケールのスペクトログラムを用いると、 mel-spectrogram 入力よりも後方パスの質が高い。
VCTK では VITS が Tacotron 2+HiFi-GAN および Glow-TTS+HiFi-GAN のベースラインを上回り、効果的な多話者モデリングを示す。
確率的な継続予測子は音素長とピッチを多様化し、品質を維持しつつ多様な話し方のリズムを生み出す。
VITS は Glow-TTS+HiFi-GAN よりエンドツーエンド生成が高速で、GPU 上でリアルタイムまたはそれ以上の速度を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。