Skip to main content
QUICK REVIEW

[Paper Review] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak|arXiv (Cornell University)|Jun 1, 2023
Music and Audio Processing11 citations
TL;DR

Vocos directly generates Fourier spectral coefficients with an isotropic, non-upsampling generator and inverse STFT for fast, high-quality audio synthesis, matching state-of-the-art vocoding while being much faster than time-domain GANs.

ABSTRACT

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Motivation & Objective

  • Motivate and develop a Fourier-based neural vocoder that preserves perceptual audio quality.
  • Eliminate learnable upsampling layers by using inverse STFT for waveform reconstruction.
  • Leverage ConvNeXt blocks to model spatial patterns in Fourier domain.
  • Achieve competitive or superior objective and subjective audio quality while substantially increasing inference speed.

Proposed method

  • Propose Vocos, a GAN-based vocoder that outputs STFT coefficients (m, p) and reconstructs the waveform via inverse STFT.
  • Use a ConvNeXt-based generator that maintains isotropic resolutions and outputs magnitude and phase (via M, x, y) to form complex STFT coefficients.
  • Represent phase with wrapped angle ϕ = atan2(y, x) ensuring proper (-π, π] wrapping.
  • Train with hinge adversarial losses, a mel-spectrogram reconstruction loss, and feature matching loss across multi-discriminator setups (MPD and MRD).
  • Operate without transposed convolutions; upsample via ISTFT, yielding an isotropic architecture and reducing aliasing artifacts.

Experimental results

Research questions

  • RQ1Can a GAN directly modeling Fourier-domain coefficients reproduce high-quality audio as well as time-domain vocoders?
  • RQ2Does avoiding upsampling layers and using ISTFT yield significant speedups without sacrificing perceptual quality?
  • RQ3What is the impact of ConvNeXt versus traditional ResBlock blocks in Fourier-domain vocoding?
  • RQ4How does a Fourier-based vocoder compare to EnCodec-like neural codecs in objective and perceptual metrics?
  • RQ5Is the phase-wrapping strategy adequate to recover perceptually faithful complex spectrograms?

Key findings

  • Vocos achieves state-of-the-art or near state-of-the-art perceptual metrics (PESQ, VISQOL) on LibriTTS-derived evaluations.
  • Vocos mitigates periodicity artifacts more effectively than time-domain GANs like HiFi-GAN, iSTFTNet, and BigVGAN.
  • Vocos foregrounds ConvNeXt blocks; replacing them with ResBlocks slightly degrades performance.
  • Inference speed is substantially higher: Vocos is about 13x faster than HiFi-GAN and ~70x faster than BigVGAN on GPU, due to ISTFT-based upsampling.
  • On MUSDB18 and out-of-distribution singing voice, Vocos yields higher perceptual quality (VISQOL) than competing models.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.