QUICK REVIEW

[论文解读] Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Sercan Ö. Arık, Gregory Diamos|arXiv (Cornell University)|May 24, 2017

Speech Recognition and Synthesis参考文献 23被引用 212

一句话总结

引入可训练的低维说话人嵌入，以在共享模型中实现多说话人神经 TTS，并改进单说话人基线，在使用 Deep Voice 2 与 Tacotron 搭配 WaveNet vocoder 的情况下，在数百位说话人中实现高质量、可区分的语音。

ABSTRACT

We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

研究动机与目标

证明神经 TTS 模型能够在单一模型中学习多位说话人声音，同时减少每位说话人的数据需求。
在单说话人 TTS 的质量上超越此前的 Deep Voice 1 和 Tacotron 基线。
展示可训练的说话人嵌入能够对模型的不同组件进行条件化，以产生不同的语音。
将 Deep Voice 2 与 Tacotron 扩展到多说话人设置，并评估语音的辨识度和质量。

提出的方法

在 Deep Voice 1 的基础上，开发具备改进的分割、时长、频率和发声模型的 Deep Voice 2。
引入基于 WaveNet 的谱图到音频的声码器，以取代 Tacotron 的 Griffin-Lim。
在模型的多个位置引入低维可训练的说话人嵌入（初始化、输入、门控）以实现多说话人合成。
将面向站点的说话人嵌入应用于分割、时长、频率和发声部件，采用如递归初始化和输入增强等策略。
对于 Tacotron，用说话人嵌入条件化编码器，并使用 WaveNet 声码器进行谱图到音频的转换。

实验结果

研究问题

RQ1一个单一的神经 TTS 模型是否能够在使用低维说话人嵌入的情况下，为数百位说话人生成高质量的语音？
RQ2在如 VCTK 和有声书等数据集上的多说话人训练，会带来哪些数据效率与质量的权衡？
RQ3说话人嵌入如何影响分割、时长、频率和声码器路径以维持说话人身份？
RQ4将 Griffin-Lim 替换为 WaveNet 声码器是否提升单说话人和多说话人 TTS 的感知音质？
RQ5在不同说话人群体中，合成语音与真实语音相比的可辨识度如何？

主要发现

Dataset	Multi-Speaker Model	Samp. Freq.	MOS	Acc.
VCTK	Deep Voice 2 (20-layer WaveNet)	16 KHz	2.87 ± 0.13	99.9%
VCTK	Deep Voice 2 (40-layer WaveNet)	16 KHz	3.21 ± 0.13	100 %
VCTK	Deep Voice 2 (60-layer WaveNet)	16 KHz	3.42 ± 0.12	99.7%
VCTK	Deep Voice 2 (80-layer WaveNet)	16 KHz	3.53 ± 0.12	99.9%
VCTK	Tacotron (Griffin-Lim)	24 KHz	1.68 ± 0.12	99.4%
VCTK	Tacotron (20-layer WaveNet)	24 KHz	2.51 ± 0.13	60.9%
Ground Truth Data	-	48 KHz	4.65 ± 0.06	99.7%
Audiobooks	Deep Voice 2 (80-layer WaveNet)	16 KHz	2.97 ± 0.17	97.4%
Audiobooks	Tacotron (Griffin-Lim)	24 KHz	1.73 ± 0.22	93.9%
Audiobooks	Tacotron (20-layer WaveNet)	24 KHz	2.11 ± 0.20	66.5%
Ground Truth Data	-	44.1 KHz	4.63 ± 0.04	98.8%

Deep Voice 2 在单说话人 MOS 评估中优于 Deep Voice 1，显示出显著的质量提升。
Tacotron 结合 WaveNet 声码器的 MOS 高于 Tacotron 结合 Griffin-Lim，表明音质提升。
单一模型可以在每位说话人数据不足半小时的情况下学习数百种独特声音，同时保持高质量和说话人可辨识性。
多说话人 Deep Voice 2 和多说话人 Tacotron 在多个数据集上实现了接近真实语音的 MOS 值和说话人辨识准确度。
在 VCTK 上，使用 40 层 WaveNet 的 Deep Voice 2 获得 MOS 3.21 ± 0.13，说话人准确率 100.0%；使用 80 层 WaveNet 获得 MOS 3.53 ± 0.12，准确率 99.9%；真实数据的 MOS 为 4.65 ± 0.06，准确率 99.7%，分别为。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。