QUICK REVIEW

[論文レビュー] Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

Gustav Eje Henter, Jaime Lorenzo-Trueba|arXiv (Cornell University)|Jul 30, 2018

Speech Recognition and Synthesis参考文献 79被引用数 51

ひとこと要約

この論文は、エンコーダ-デコーダおよび変分オートエンコーダフレームワークを用いて、監督なしの方法で発話合成における controllable output を学習する研究で、既存のヒューリスティクスを確率的潜在変数モデルおよび VQ-VAE に結びつける。感情的な音声合成において、これらの監督なしアプローチが監督付き手法と同等またはそれを上回ることを示す。

ABSTRACT

Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example, we show that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models. We additionally connect these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which we show can be derived from a very similar mathematical argument. The implications of these new probabilistic interpretations are discussed. We illustrate the utility of the various approaches with an application to acoustic modelling for emotional speech synthesis, where the unsupervised methods for learning expression control (without access to emotional labels) are found to give results that in many aspects match or surpass the previous best supervised approach.

研究の動機と目的

注釈のない変動から学ぶことによって、テキスト注釈を超えた controllable な音声合成を動機付ける。
既存の監督なし制御手法の確率的解釈を確立する。
一般的なヒューリスティクスを変分オートエンコーダと VQ-VAEs に結びつける。
大規模な感情音声データベースに対して、監督付きベースラインと比較して監督なし制御手法を評価する。

提案手法

テキスト入力を伴う音声合成の潜在変数モデルとして制御問題を定義する。
変分推論を用いて下界を導出し、訓練のヒューリスティクスを近似的最大尤度として解釈する。
DCC に類する制御と VQ-VAE フレームワークとの同値性/接続を示す。
監督なしの制御手法への事前情報の組み込みについて議論する。
感情音声で経験的に評価し、監督付きシステムと比較する。

実験結果

リサーチクエスチョン

RQ1感情ラベルなしで、潜在的な制御変数を監督なしで学習して、 controllable な音声を生成できるのか？
RQ2既存の監督なし制御ヒューリスティクスは、変分推論および VQ-VAE の原理とどのように関連するか？
RQ3表現力豊かな（感情的な）音声合成において、監督なしアプローチは監督付きモデルと同等またはそれを上回るか？

主な発見

監督なし制御手法は、変分下界を介して近似的最大尤度推定量として解釈できる。
一般的なエンコーダ-デコーダ手法と VQ-VAEs の理論的な接続がある。
事前情報はヒューリスティックな監督なし手法に組み込むことができる。
大規模な感情音声データベースでの実験は、監督なし手法が競合の監督付きシステムと同等、あるいはそれを上回る性能を示した。
感情関連の音響モデリングにおいて、監督なしアプローチは従来の最良の監督付き手法と同等かそれを上回る結果を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。