QUICK REVIEW

[Paper Review] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak|arXiv (Cornell University)|Jun 1, 2023

Music and Audio Processing11 citations

TL;DR

Vocos directly generates Fourier spectral coefficients with an isotropic, non-upsampling generator and inverse STFT for fast, high-quality audio synthesis, matching state-of-the-art vocoding while being much faster than time-domain GANs.

ABSTRACT

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

Motivation & Objective

Motivate and develop a Fourier-based neural vocoder that preserves perceptual audio quality.
Eliminate learnable upsampling layers by using inverse STFT for waveform reconstruction.
Leverage ConvNeXt blocks to model spatial patterns in Fourier domain.
Achieve competitive or superior objective and subjective audio quality while substantially increasing inference speed.

Proposed method

Propose Vocos, a GAN-based vocoder that outputs STFT coefficients (m, p) and reconstructs the waveform via inverse STFT.
Use a ConvNeXt-based generator that maintains isotropic resolutions and outputs magnitude and phase (via M, x, y) to form complex STFT coefficients.
Represent phase with wrapped angle ϕ = atan2(y, x) ensuring proper (-π, π] wrapping.
Train with hinge adversarial losses, a mel-spectrogram reconstruction loss, and feature matching loss across multi-discriminator setups (MPD and MRD).
Operate without transposed convolutions; upsample via ISTFT, yielding an isotropic architecture and reducing aliasing artifacts.

Experimental results

Research questions

RQ1Can a GAN directly modeling Fourier-domain coefficients reproduce high-quality audio as well as time-domain vocoders?
RQ2Does avoiding upsampling layers and using ISTFT yield significant speedups without sacrificing perceptual quality?
RQ3What is the impact of ConvNeXt versus traditional ResBlock blocks in Fourier-domain vocoding?
RQ4How does a Fourier-based vocoder compare to EnCodec-like neural codecs in objective and perceptual metrics?
RQ5Is the phase-wrapping strategy adequate to recover perceptually faithful complex spectrograms?

Key findings

Vocos achieves state-of-the-art or near state-of-the-art perceptual metrics (PESQ, VISQOL) on LibriTTS-derived evaluations.
Vocos mitigates periodicity artifacts more effectively than time-domain GANs like HiFi-GAN, iSTFTNet, and BigVGAN.
Vocos foregrounds ConvNeXt blocks; replacing them with ResBlocks slightly degrades performance.
Inference speed is substantially higher: Vocos is about 13x faster than HiFi-GAN and ~70x faster than BigVGAN on GPU, due to ISTFT-based upsampling.
On MUSDB18 and out-of-distribution singing voice, Vocos yields higher perceptual quality (VISQOL) than competing models.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.