[Paper Review] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
Vocos directly generates Fourier spectral coefficients with an isotropic, non-upsampling generator and inverse STFT for fast, high-quality audio synthesis, matching state-of-the-art vocoding while being much faster than time-domain GANs.
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.
Motivation & Objective
- Motivate and develop a Fourier-based neural vocoder that preserves perceptual audio quality.
- Eliminate learnable upsampling layers by using inverse STFT for waveform reconstruction.
- Leverage ConvNeXt blocks to model spatial patterns in Fourier domain.
- Achieve competitive or superior objective and subjective audio quality while substantially increasing inference speed.
Proposed method
- Propose Vocos, a GAN-based vocoder that outputs STFT coefficients (m, p) and reconstructs the waveform via inverse STFT.
- Use a ConvNeXt-based generator that maintains isotropic resolutions and outputs magnitude and phase (via M, x, y) to form complex STFT coefficients.
- Represent phase with wrapped angle ϕ = atan2(y, x) ensuring proper (-π, π] wrapping.
- Train with hinge adversarial losses, a mel-spectrogram reconstruction loss, and feature matching loss across multi-discriminator setups (MPD and MRD).
- Operate without transposed convolutions; upsample via ISTFT, yielding an isotropic architecture and reducing aliasing artifacts.
Experimental results
Research questions
- RQ1Can a GAN directly modeling Fourier-domain coefficients reproduce high-quality audio as well as time-domain vocoders?
- RQ2Does avoiding upsampling layers and using ISTFT yield significant speedups without sacrificing perceptual quality?
- RQ3What is the impact of ConvNeXt versus traditional ResBlock blocks in Fourier-domain vocoding?
- RQ4How does a Fourier-based vocoder compare to EnCodec-like neural codecs in objective and perceptual metrics?
- RQ5Is the phase-wrapping strategy adequate to recover perceptually faithful complex spectrograms?
Key findings
- Vocos achieves state-of-the-art or near state-of-the-art perceptual metrics (PESQ, VISQOL) on LibriTTS-derived evaluations.
- Vocos mitigates periodicity artifacts more effectively than time-domain GANs like HiFi-GAN, iSTFTNet, and BigVGAN.
- Vocos foregrounds ConvNeXt blocks; replacing them with ResBlocks slightly degrades performance.
- Inference speed is substantially higher: Vocos is about 13x faster than HiFi-GAN and ~70x faster than BigVGAN on GPU, due to ISTFT-based upsampling.
- On MUSDB18 and out-of-distribution singing voice, Vocos yields higher perceptual quality (VISQOL) than competing models.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.