QUICK REVIEW

[Paper Review] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim|arXiv (Cornell University)|Oct 12, 2020

Speech and Audio Processing23 references739 citations

TL;DR

HiFi-GAN introduces a GAN-based vocoder with multi-period and multi-scale discriminators and a multi-receptive-field generator to deliver high-fidelity speech efficiently, outperforming autoregressive and flow-based models in MOS and speed.

ABSTRACT

Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart.

Motivation & Objective

Motivate neural vocoders that balance speech quality with synthesis speed.
Develop a generator-discriminator architecture that captures periodic patterns in audio.
Improve training stability and perceptual quality via auxiliary losses and multi-discriminator design.
Demonstrate generalization to unseen speakers and end-to-end text-to-speech pipelines.

Proposed method

Propose HiFi-GAN with one generator and two discriminators (multi-scale and multi-period).
Introduce Multi-Receptive Field Fusion (MRF) in the generator to capture patterns of varying lengths.
Employ a Multi-Period Discriminator (MPD) with periods [2,3,5,7,11] to model periodic components.
Use a Multi-Scale Discriminator (MSD) to evaluate audio at multiple temporal scales.
Train with a combination of adversarial loss (LSGAN), mel-spectrogram loss (L1), and feature matching loss (LFM).
Provide three generator configurations (V1, V2, V3) to trade off quality and efficiency.

Experimental results

Research questions

RQ1Can a GAN-based vocoder achieve high perceptual quality comparable to autoregressive and flow-based models?
RQ2Does explicitly modeling periodic patterns via MPD improve speech synthesis quality?
RQ3How do multi-scale and multi-period discriminators affect training stability and sample fidelity?
RQ4Can HiFi-GAN generalize to unseen speakers and end-to-end TTS pipelines?

Key findings

HiFi-GAN variants outperform WaveNet (MoL), WaveGlow, and MelGAN in MOS on LJSpeech.
V1 achieves MOS 4.36 (CI 0.07) with 13.92M parameters and real-time-like GPU speed (×167.86) on 3.7 MHz synthesis.
V2 uses 0.92M parameters and achieves MOS 4.23 (CI 0.07) with substantial speed gains (×764.80 on GPU).
V3 is smallest (notable CPU real-time on CPU: ×13.44; GPU: ×1186.80) with MOS 4.05 (CI 0.08), suitable for on-device use.
Ablation shows MPD is critical (no MPD MOS drops to 2.28), MSD contributes to quality, and mel-spectrogram loss stabilizes training.
End-to-end fine-tuning with Tacotron2 improves end-to-end MOS for HiFi-GAN variants.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.