QUICK REVIEW

[论文解读] Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

Yi Zhao, Wen-Chin Huang|arXiv (Cornell University)|Aug 28, 2020

Speech Recognition and Synthesis被引用 34

一句话总结

本论文报道 VCC 2020：两个任务（同语内半并行与跨语言 VC）、一个新的多语言数据集、提交的系统，以及基于主观评估显示快速的 VC 进展，但在跨语言情境下与人类自然度仍存在差距。

ABSTRACT

The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, including 3 baselines built on the database. From the results of crowd-sourced listening tests, we observed that VC methods have progressed rapidly thanks to advanced deep learning methods. In particular, speaker similarity scores of several systems turned out to be as high as target speakers in the intra-lingual semi-parallel VC task. However, we confirmed that none of them have achieved human-level naturalness yet for the same task. The cross-lingual conversion task is, as expected, a more difficult task, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task. However, we observed encouraging results, and the MOS scores of the best systems were higher than 4.0. We also show a few additional analysis results to aid in understanding cross-lingual VC better.

研究动机与目标

在同语内半并行和跨语言设置下，提供一个用于比较语音转换（VC）方法的公共数据集和任务。
使用基于众包听力测试的自然度和说话人相似度来评估 VC 系统的进展。
了解语言差异对 VC 性能评估的影响。
记录参与者使用的系统架构和波形生成方法。

提出的方法

在 EMIME 多语言语料库上构建两个 VC 任务（同语内半并行和跨语言）。
发布训练和评估数据并征集参与者提交（34 个系统包括基线）。
将特征转换模型分为编码器-解码器、基于 GAN 的，以及并行光谱映射；分析它们在各任务中的使用情况。
使用主观 MOS 基于自然度和同/不同说话人相似度测试对转换语音进行评估；比较波形生成的 vocoder（神经与传统）。
提供基线系统与代表性系统的详细描述（如 T10）以便分析。

实验结果

研究问题

RQ1在同一数据集上，同语内半并行与跨语言设置中，VC 系统的性能如何？
RQ2哪些架构（编码器-解码器、GAN 基于、并行光谱映射）和 vociders 在每个任务中推动最佳性能？
RQ3语言差异在多大程度上影响自然度和说话人相似度的评估？
RQ4关于跨语言 VC 的真实自然度与转换语音之间，可以得出哪些洞见？
RQ5在这些任务中，表现最好的系统与人类水平自然度相比有何差异？

主要发现

由于深度学习，VC 方法进展迅速，部分同语内半并行系统在说话人相似度方面接近目标说话人。
没有系统达到同语内半并行 VC 的人类水平自然度。
跨语言 VC 结果更具挑战性，但最佳系统在自然度上达到 MOS 大于4.0。
大量提交使用编码器-解码器或基于 GAN 的模型进行特征转换，通常使用非并行数据；并行光谱模型较少见。
神经 vocoders（如 WaveNet、WaveRNN、LPCNet、Parallel WaveGAN）和非自回归 vocoders（如 WaveGlow、MelGAN、NSF）被广泛用于波形生成；传统 vocoders（WORLD、Griffin-Lim）也在部分系统中使用。
评估包括本地和非本地听众，跨语言任务使用多语言参考（英语、德语、芬兰语、普通话）以反映实际翻译场景。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。