QUICK REVIEW

[论文解读] High Fidelity Speech Synthesis with Adversarial Networks

Mikołaj Bińkowski, Jeff Donahue|arXiv (Cornell University)|Sep 25, 2019

Speech and Audio Processing参考文献 52被引用 104

一句话总结

GAN-TTS 使用前馈生成器和随机窗口判别器的集成来合成高保真原始音频用于文本转语音，达到与 WaveNet 相媲美的 MOS，并具备高效并行生成。它还引入基于 DeepSpeech 的条件与无条件评估指标。

ABSTRACT

Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech. Our architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyse the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced. To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS - Mean Opinion Score), as well as novel quantitative metrics (Fréchet DeepSpeech Distance and Kernel DeepSpeech Distance), which we find to be well correlated with MOS. We show that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator. Listen to GAN-TTS reading this abstract at https://storage.googleapis.com/deepmind-media/research/abstract.wav.

研究动机与目标

证明对抗性训练的前馈网络能够生成高保真的原始语音波形。
提出一个随机窗口判别器的集成（有条件和无条件）以评估真实感与文本-话语对齐。
引入基于深度语音特征的弗雷歇特距离和核距离的客观语音生成指标。
评估 GAN-TTS 相对于自回归基线的性能并进行消融实验以验证架构选择。

提出的方法

提出 GAN-TTS：一个有条件的前馈生成器，利用 200 Hz 的语言/音高特征生成 24 kHz 的原始音频。
实现在多个窗口大小下的随机窗口判别器（RWDs）集合，既有有条件版本也有无条件版本。
使用 RWD 集合的对抗损失进行训练，以提高真实感和文本-话语一致性。
使用主观 MOS 和客观指标进行评估：基于 DeepSpeech 特征的 FDSD/KDSD 与 cFDSD/cKDSD。
在生成器中使用 μ-law 编码和带残差连接的扩张卷积块，以获得长程依赖。

实验结果

研究问题

RQ1非自回归、前馈生成器结合判别器集合是否能产生与自回归模型相当自然度的语音？
RQ2跨多个窗口大小的随机窗口判别器是否提升真实感和文本-话语对齐？
RQ3基于 DeepSpeech 的弗雷歇特距离和核距离是否能可靠地与人类 MOS 对 TTS 模型相关？
RQ4条件与无条件判别器的不同配置如何影响质量和评估指标？

主要发现

最佳 GAN-TTS 模型达到 MOS = 4.213±0.046，媲美 WaveNet 等强基线。
全部多窗口判别器集成在 MOS 和指标上超过单一判别器和确定性全判别器。
无条件 RWD 提高性能；将多个有条件 RWD 与无条件 RWD 结合在消融中取得最佳结果。
条件/无条件的弗雷歇特 DeepSpeech 距离（FDSD）和核 DeepSpeech 距离（KDSD）与 MOS 相关，支持它们在评估中的有效性。
GAN-TTS 在自然度上与自回归模型相当，同时实现更易并行化的高效波形生成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。