QUICK REVIEW

[论文解读] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim|arXiv (Cornell University)|Oct 12, 2020

Speech and Audio Processing参考文献 23被引用 739

一句话总结

HiFi-GAN 引入了一种基于 GAN 的声码器，具备多周期和多尺度判别器以及多感受野生成器，以高效地提供高保真语音，在 MOS 和速度方面超过自回归和流式模型。

ABSTRACT

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

研究动机与目标

激励在语音质量与合成速度之间取得平衡的神经声码器。
开发能够捕捉音频周期性模式的生成器-判别器架构。
通过辅助损失和多判别器设计提升训练稳定性与感知质量。
展示对未见说话人以及端到端文本到语音管线的泛化能力。

提出的方法

提出 HiFi-GAN，包含一个生成器和两个判别器（多尺度和多周期）。
在生成器中引入多感受野融合（MRF）以捕捉不同长度的模式。
使用带周期 [2,3,5,7,11] 的多周期判别器（MPD）来建模周期分量。
使用多尺度判别器（MSD）在多种时间尺度上评估音频。
使用对抗损失（LSGAN）、梅尔频谱损失（L1）和特征匹配损失（LFM）的组合进行训练。
提供三种生成器配置（V1、V2、V3）以在质量与效率之间做权衡。

实验结果

研究问题

RQ1基于 GAN 的声码器是否能达到与自回归和流式模型相当的高感知质量？
RQ2通过 MPD 明确建模周期模式是否会提升语音合成质量？
RQ3多尺度和多周期判别器如何影响训练稳定性和样本保真度？
RQ4HiFi-GAN 是否能对未见说话人和端到端 TTS 管线进行泛化？

主要发现

在 LJSpeech 上，HiFi-GAN 的变体在 MOS 上优于 WaveNet（MoL）、WaveGlow 和 MelGAN。
V1 在 3.7 MHz 的合成下，参数量 13.92M，MOS 4.36 (CI 0.07)，GPU 近实时速度（×167.86）。
V2 使用 0.92M 参数，MOS 4.23 (CI 0.07)，GPU 速度显著提升（×764.80）。
V3 最小（在 CPU 上达到显著的实时性：×13.44；GPU：×1186.80），MOS 4.05 (CI 0.08)，适合在设备端使用。
消融实验显示 MPD 关键（无 MPD 时 MOS 降至 2.28），MSD 提升质量，梅尔频谱损失稳定训练。
使用 Tacotron2 的端到端微调提升 HiFi-GAN 变体的端到端 MOS。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。