QUICK REVIEW

[论文解读] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak|arXiv (Cornell University)|Jun 1, 2023

Music and Audio Processing被引用 11

一句话总结

Vocos 直接生成带有各向同性、非上采样生成器的傅里叶谱系数，并使用逆 STFT 进行快速高质量音频合成，达到与最先进的声码器相当的效果，同时比时域 GAN 快得多。

ABSTRACT

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

研究动机与目标

Motivate and develop a Fourier-based neural vocoder that preserves perceptual audio quality.
Eliminate learnable upsampling layers by using inverse STFT for waveform reconstruction.
Leverage ConvNeXt blocks to model spatial patterns in Fourier domain.
Achieve competitive or superior objective and subjective audio quality while substantially increasing inference speed.

提出的方法

Propose Vocos, a GAN-based vocoder that outputs STFT coefficients (m, p) and reconstructs the waveform via inverse STFT.
Use a ConvNeXt-based generator that maintains isotropic resolutions and outputs magnitude and phase (via M, x, y) to form complex STFT coefficients.
Represent phase with wrapped angle ϕ = atan2(y, x) ensuring proper (-π, π] wrapping.
Train with hinge adversarial losses, a mel-spectrogram reconstruction loss, and feature matching loss across multi-discriminator setups (MPD and MRD).
Operate without transposed convolutions; upsample via ISTFT, yielding an isotropic architecture and reducing aliasing artifacts.

实验结果

研究问题

RQ1Can a GAN directly modeling Fourier-domain coefficients reproduce high-quality audio as well as time-domain vocoders?
RQ2Does avoiding upsampling layers and using ISTFT yield significant speedups without sacrificing perceptual quality?
RQ3What is the impact of ConvNeXt versus traditional ResBlock blocks in Fourier-domain vocoding?
RQ4How does a Fourier-based vocoder compare to EnCodec-like neural codecs in objective and perceptual metrics?
RQ5Is the phase-wrapping strategy adequate to recover perceptually faithful complex spectrograms?

主要发现

模型	UTMOS (↑)	VISQOL (↑)	PESQ (↑)	V/UV F1 (↑)	周期性 (↓)
Ground truth	4.058	–	–	–	–
HiFi-GAN	3.669	4.57	3.093	0.9457	0.129
iSTFTNet	3.564	4.56	2.942	0.9372	0.141
BigVGAN	3.749	4.65	3.693	0.9557	0.108
Vocos	3.734	4.66	3.70	0.9582	0.101
w/o ConvNeXt	3.658	4.65	3.528	0.9534	0.109

Vocos achieves state-of-the-art or near state-of-the-art perceptual metrics (PESQ, VISQOL) on LibriTTS-derived evaluations.
Vocos mitigates periodicity artifacts more effectively than time-domain GANs like HiFi-GAN, iSTFTNet, and BigVGAN.
Vocos foregrounds ConvNeXt blocks; replacing them with ResBlocks slightly degrades performance.
Inference speed is substantially higher: Vocos is about 13x faster than HiFi-GAN and ~70x faster than BigVGAN on GPU, due to ISTFT-based upsampling.
On MUSDB18 and out-of-distribution singing voice, Vocos yields higher perceptual quality (VISQOL) than competing models.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。