QUICK REVIEW

[论文解读] AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

Yao Shi, Hui Bu|arXiv (Cornell University)|Oct 22, 2020

Speech Recognition and Synthesis参考文献 23被引用 82

一句话总结

AISHELL-3 提供一个大规模的 Mandarin 多说话人语料库（≈85 小时，218 位说话者），含中文字符与拼音转录，以及一个基线的多说话人 TTS 系统，带说话人嵌入反馈以实现零-shot 声音克隆。

ABSTRACT

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

研究动机与目标

提供一个大型、开放的 Mandarin 多说话人语料库用于 TTS 研究。
使能带有明确说话人属性（性别、年龄、口音）的多说话人 TTS 系统训练。
展示一个使用说话人嵌入和反馈约束的基线多说话人 TTS 系统。
研究数据准备与扩增策略以提高模型鲁棒性与泛化。
使用客观指标评估已见与未见说话人的说话人相似度与泛化能力。

提出的方法

引入 AISHELL-3 数据集，具 85 小时、218 位本地 Mandarin 说话者、88,035 条记录，以及中文字符与拼音转录。
开发一个说话人无关的文本前端和一个基于 Tacotron-2、带说话人编码器用于声音条件的说话人感知声学模型。
通过在参考与合成的说话人嵌入之间加入余弦相似性损失来引入说话人嵌入反馈约束。
使用基于 ResNet 的说话人编码器并采用全局均值-方差池化以获得固定维度的说话人嵌入。
应用数据准备技术包括韵律标签预测、静音裁剪、长句拼接等，以改善对齐和泛化。
在 seen 与 unseen 说话人上训练并评估说话人相似度的客观量化指标（余弦相似度、SV-EER）。

实验结果

研究问题

RQ1AISHELL-3 是否能够支持有效训练用于 Mandarin 的多说话人 TTS，包含零-shot 语音克隆？
RQ2说话人嵌入反馈约束如何影响未见说话人的说话人相似度与鲁棒性？
RQ3哪些数据准备与扩增策略能提高 Mandarin TTS 的对齐、韵律与长句合成？
RQ4与英语多说话人语料相比，基线系统从 seen 到 unseen 的泛化能力有多强？
RQ5哪些客观指标能反映合成 Mandarin 语音中的说话人相似性与声音身份？

主要发现

AISHELL-3 含有 218 位说话者的 85 小时 Mandarin 语音，且具性别、年龄和口音标注，以及中文字符与拼音转录。
基线多说话人 TTS 系统（Tacotron-2 + 说话人编码器与嵌入反馈）在 seen 和 unseen 说话人上的距离相似性指标（余弦相似度与 SV-EER）表明具有高说话人相似性。
Obj. 评估显示模型在泛化到 unseen 说话人时仍保持说话人相似性，EER 变化与先前基于 English VCTK 的工作一致。
数据扩增与预处理（韵律标签、静音裁剪、长句拼接）提升了 Tacotron-2 最优化阶段的训练效率与对齐。
该模型使用冻结的说话人编码器并在训练中引入余弦相似性损失项来强化声音相似性（alpha = 1.0）。
结果与基于 VCTK 的研究一致，表明 AISHELL-3 基线支持稳健的多说话人 Mandarin 合成与零-shot 声音克隆。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。