QUICK REVIEW

[論文レビュー] CHiVE: Varying Prosody in Speech Synthesis with a Linguistically Driven Dynamic Hierarchical Conditional Variational Network

Vincent Wan, Chun-an Chan|arXiv (Cornell University)|May 17, 2019

Speech Recognition and Synthesis参考文献 29被引用数 51

ひとこと要約

tldr: CHiVE は言語学的に駆動される動的階層型条件付きVAEを導入し、多様な韻律特徴を生成し、文ごとの韻律転送を可能にする。階層を持たないベースラインより自然さを向上。

ABSTRACT

The prosodic aspects of speech signals produced by current text-to-speech systems are typically averaged over training material, and as such lack the variety and liveliness found in natural speech. To avoid monotony and averaged prosody contours, it is desirable to have a way of modeling the variation in the prosodic aspects of speech, so audio signals can be synthesized in multiple ways for a given text. We present a new, hierarchically structured conditional variational autoencoder to generate prosodic features (fundamental frequency, energy and duration) suitable for use with a vocoder or a generative model like WaveNet. At inference time, an embedding representing the prosody of a sentence may be sampled from the variational layer to allow for prosodic variation. To efficiently capture the hierarchical nature of the linguistic input (words, syllables and phones), both the encoder and decoder parts of the auto-encoder are hierarchical, in line with the linguistic structure, with layers being clocked dynamically at the respective rates. We show in our experiments that our dynamic hierarchical network outperforms a non-hierarchical state-of-the-art baseline, and, additionally, that prosody transfer across sentences is possible by employing the prosody embedding of one sentence to generate the speech signal of another.

研究の動機と目的

Motivate modeling of per-utterance prosody variation to avoid averaging in TTS.
Propose a dynamic clockwork hierarchical VAE that aligns with linguistic structure (words, syllables, phones).
Learn a sentence-level prosody embedding to capture and sample prosodic variation.
Enable prosody transfer from a reference sentence to a different textual content.
Demonstrate that hierarchical structure yields more natural and expressive prosody than a flat baseline.

提案手法

Propose CHiVE, a clockwork hierarchical conditional variational auto-encoder with encoder, variational layer, and decoder.
Use hierarchical RNNs at frame/phone/syllable levels in both encoder and decoder to reflect linguistic structure.
Insert a variational layer that outputs mean and variance for a sentence prosody embedding, sampled from a Gaussian.
Condition the decoder on the linguistic features plus a sampled sentence prosody embedding to predict duration, F0/c0, and energy-related features.
Train with L2 losses on duration and F0/c0, plus KL divergence for the variational layer.
During inference, sample from the prior or encode a sentence and optionally transfer prosody to another sentence by conditioning on its linguistic features.]
research_questions: ["Can a dynamic hierarchical VAE capture meaningful per-utterance prosodic variation for TTS?","Does a linguistically driven clockwork hierarchy improve prosody modeling over a non-hierarchical baseline?","Is it possible to transfer prosody from one sentence to another using the CHiVE latent space?","What is the impact of embedding type (zero vs encoded vs random) on prosody quality and naturalness?"]
key_findings':['CHiVE’s dynamic hierarchical model is significantly preferred over a non-hierarchical baseline in AB side-by-side evaluations (baseline preferred 292, CHiVE preferred 438; p = 3.91e-8).','MOS tests show CHiVE achieves higher naturalness than the baseline, with scores: Baseline 4.01±0.11, CHiVE zero embedding 4.07±0.10, CHiVE encoded 4.25±0.10, real speech 4.67±0.07.','Using encoder mean embeddings reduces log F0 RMSE by 21% relative to the baseline when evaluated on held-out data.','Prosody transfer is demonstrated by conditioning the decoder on another sentence’s prosody embedding, producing transfer-like variations in log F0 contours.','The zero embedding yields reasonable, yet less expressive than encoded, prosody, while random embeddings tend to produce more varied but less accurate F0 contours.'],
table_headers: []
table_rows: []

実験結果

リサーチクエスチョン

RQ1Can a dynamic hierarchical VAE capture meaningful per-utterance prosodic variation for TTS?
RQ2Does a linguistically driven clockwork hierarchy improve prosody modeling over a non-hierarchical baseline?
RQ3Is it possible to transfer prosody from one sentence to another using the CHiVE latent space?
RQ4What is the impact of embedding type (zero vs encoded vs random) on prosody quality and naturalness?

主な発見

CHiVE’s dynamic hierarchical model is significantly preferred over a non-hierarchical baseline in AB side-by-side evaluations (baseline preferred 292, CHiVE preferred 438; p = 3.91e-8).
MOS tests show CHiVE achieves higher naturalness than the baseline, with scores: Baseline 4.01±0.11, CHiVE zero embedding 4.07±0.10, CHiVE encoded 4.25±0.10, real speech 4.67±0.07.
Using encoder mean embeddings reduces log F0 RMSE by 21% relative to the baseline when evaluated on held-out data.
Prosody transfer is demonstrated by conditioning the decoder on another sentence’s prosody embedding, producing transfer-like variations in log F0 contours.
The zero embedding yields reasonable, yet less expressive than encoded, prosody, while random embeddings tend to produce more varied but less accurate F0 contours.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。