QUICK REVIEW

[論文レビュー] Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

Takuhiro Kaneko, Hirokazu Kameoka|arXiv (Cornell University)|Nov 30, 2017

Speech Recognition and Synthesis参考文献 32被引用数 179

ひとこと要約

本論文は CycleGAN-VC を提案し、並列データなしで CycleGAN を用いた声質変換を可能にする手法。ゲート付き CNN とアイデンティティマッピング損失を使い、並列データなしでソースからターゲット Speech をマッピングし、過度滑らかさを低減。

ABSTRACT

We propose a parallel-data-free voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data. The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss. A CycleGAN learns forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. This makes it possible to find an optimal pseudo pair from unpaired data. Furthermore, the adversarial loss contributes to reducing over-smoothing of the converted feature sequence. We configure a CycleGAN with gated CNNs and train it with an identity-mapping loss. This allows the mapping function to capture sequential and hierarchical structures while preserving linguistic information. We evaluated our method on a parallel-data-free VC task. An objective evaluation showed that the converted feature sequence was near natural in terms of global variance and modulation spectra. A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

研究の動機と目的

並列データや追加のアライメントモジュールを必要としない声質変換の必要性を動機付け、解決する。
従来法に内在する過度な平滑化を避けた、汎用で高品質な声質変換手法を開発する。
非対になデータから前方・逆方向の写像を学習しつつ、言語情報を保持するために CycleGAN を活用する。
VCC 2016 で並列データなしにほぼ自然に近い特徴変換を実現できることを示す。

提案手法

対立的損失とサイクル整合性損失で訓練された前方および逆方向写像（G_X->Y と G_Y->X）を用いる CycleGAN を用いる。
連続的・階層的な音声構造を捉えるためにゲート付きCNN（GLU活性化）を取り入れる。
言語情報を保持するためのアイデンティティマッピング損失を追加し、サイクルおよびアイデンティティ項には L1 損失を用いる。
学習を安定させるために最小二乗 GAN 目的関数で訓練する。
ソースとターゲットを 24 Mel-cepstral coefficients、対数 F0、APs を用いて表現し、MCEP ドメインを変換し、F0 も適切に変換する。
WORLD vocoder ベースの特徴とセグメントのランダムクロップを用いてバッチの多様性を高める。

実験結果

リサーチクエスチョン

RQ1CycleGAN ベースのモデルは並列データなしでソースからターゲットへの声の写像を学習できるか？
RQ2ゲート付きCNNとアイデンティティマッピング損失を取り入れることで、過度な平滑化を抑えつつ言語情報を保持できるか？
RQ3データ条件が制約される場合、並列データなしの CycleGAN-VC はデータ条件が限られた場合の GMM ベースの VC とどう比較されるか？
RQ4GV、MS という客観指標と主観的 MOS スコアは変換後の MCEP の品質をどのように示すか？
RQ5データの半分で並列性なしという非理想的な条件下で CycleGAN-VC は競争力を持つか？

主な発見

手法	SF1–TF2	SF1–TM3	SM1–TF2	SM1–TM3
CycleGAN-VC w/ GLU	1.98	2.69	1.93	2.14
CycleGAN-VC w/o GLU	3.34	2.99	3.17	2.94
GMM-VC w/ GV	7.59	9.41	8.69	9.67
GMM-VC w/o GV	13.56	14.90	14.17	14.53

GLU を用いた CycleGAN-VC は、アブレーションおよび GMM-VC のベースラインと比較して、GV および MS でターゲットに最も近い MCEP シーケンスを達成。
対数 MS の客観 RMSE は、GLU 付き CycleGAN-VC が GLU なしおよび非 GLU バリアントをスピーカー対間で上回ることを示す。
主観的 MOS は、並列データなし条件下で CycleGAN-VC が VCC 2016 ベースラインより自然さで上回ることを示す。
CycleGAN-VC は、並列データで訓練された、データ量が倍の GMM ベース手法と同等である一方、非並列・データ量が小さいという条件にもかかわらず。
このアプローチは対立的損失による過度平滑化を低減し、逐次構造のモデリングに GLU 活性化が有利である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。