QUICK REVIEW

[论文解读] Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

Takuhiro Kaneko, Hirokazu Kameoka|arXiv (Cornell University)|Nov 30, 2017

Speech Recognition and Synthesis参考文献 32被引用 179

一句话总结

本论文提出 CycleGAN-VC，一种在没有并行数据的情况下使用带门控CNNs和身份映射损失的循环一致性GAN来将源语音映射到目标语音，减少过平滑。

ABSTRACT

We propose a parallel-data-free voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data. The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our method, called CycleGAN-VC, uses a cycle-consistent adversarial network (CycleGAN) with gated convolutional neural networks (CNNs) and an identity-mapping loss. A CycleGAN learns forward and inverse mappings simultaneously using adversarial and cycle-consistency losses. This makes it possible to find an optimal pseudo pair from unpaired data. Furthermore, the adversarial loss contributes to reducing over-smoothing of the converted feature sequence. We configure a CycleGAN with gated CNNs and train it with an identity-mapping loss. This allows the mapping function to capture sequential and hierarchical structures while preserving linguistic information. We evaluated our method on a parallel-data-free VC task. An objective evaluation showed that the converted feature sequence was near natural in terms of global variance and modulation spectra. A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

研究动机与目标

激发并解决在没有并行数据或额外对齐模块的情况下进行语音转换的需求。
开发一种通用的高质量语音转换方法，避免传统方法固有的过平滑问题。
利用CycleGAN学习来自非配对数据的前向和反向映射，同时保留语言信息。
证明 CycleGAN-VC 在 VCC 2016 上在无并行数据的条件下也能实现接近自然的特征转换。

提出的方法

使用带前向和后向映射（G_X->Y 和 G_Y->X）的 CycleGAN，通过对抗损失和循环一致性损失进行训练。
引入带门控CNN（GLU 激活）以捕捉语音的序列和层次结构。
添加身份映射损失以保留语言信息，并对循环和身份项使用 L1 损失。
采用最小二乘GAN目标来稳定训练。
用 24 个梅尔倒谱系数、对数 F0 和 AP 来表征源/目标；转换 MCEP 域并相应地变换 F0。
使用基于 WORLD vocoder 的特征并对片段进行随机裁剪以增加批量多样性。

实验结果

研究问题

RQ1基于 CycleGAN 的模型是否能够在没有并行数据的情况下学习源到目标的语音映射？
RQ2结合带门控 CNNs 和身份映射损失是否在减少过平滑的同时保留语言信息？
RQ3在数据条件受限时，无需并行数据的 CycleGAN-VC 与基于 GMM 的语音转换在性能上有何差异？
RQ4哪些客观指标（GV、MS）和主观 MOS 分数能够指示转换后的 MCEP 的质量？
RQ5在数据条件不理想、数据量只有一半且无并行性的情况下，CycleGAN-VC 是否具备竞争力？

主要发现

在 GV 和 MS 指标下，结合 GLU 的 CycleGAN-VC 相较于消融实验和 GMM-VC 基线，获得了最接近目标的 MCEP 序列。
在对数 MS 的客观 RMSE 指标显示，带 GLU 的 CycleGAN-VC 在跨说话人对的情况下优于不带 GLU 的 CycleGAN-VC 及非 GLU 变体。
主观 MOS 指示在无并行数据条件下，CycleGAN-VC 在自然度方面优于 VCC 2016 基线。
CycleGAN-VC 与在并行数据条件下使用双倍数据训练的基于GMM的方法相当，尽管其数据为非并行且规模较小。
该方法通过对抗损失降低过平滑，并受益于 GLU 激活在建模序列结构方面的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。