[论文解读] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
Vocos 直接生成带有各向同性、非上采样生成器的傅里叶谱系数,并使用逆 STFT 进行快速高质量音频合成,达到与最先进的声码器相当的效果,同时比时域 GAN 快得多。
Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.
研究动机与目标
- Motivate and develop a Fourier-based neural vocoder that preserves perceptual audio quality.
- Eliminate learnable upsampling layers by using inverse STFT for waveform reconstruction.
- Leverage ConvNeXt blocks to model spatial patterns in Fourier domain.
- Achieve competitive or superior objective and subjective audio quality while substantially increasing inference speed.
提出的方法
- Propose Vocos, a GAN-based vocoder that outputs STFT coefficients (m, p) and reconstructs the waveform via inverse STFT.
- Use a ConvNeXt-based generator that maintains isotropic resolutions and outputs magnitude and phase (via M, x, y) to form complex STFT coefficients.
- Represent phase with wrapped angle ϕ = atan2(y, x) ensuring proper (-π, π] wrapping.
- Train with hinge adversarial losses, a mel-spectrogram reconstruction loss, and feature matching loss across multi-discriminator setups (MPD and MRD).
- Operate without transposed convolutions; upsample via ISTFT, yielding an isotropic architecture and reducing aliasing artifacts.
实验结果
研究问题
- RQ1Can a GAN directly modeling Fourier-domain coefficients reproduce high-quality audio as well as time-domain vocoders?
- RQ2Does avoiding upsampling layers and using ISTFT yield significant speedups without sacrificing perceptual quality?
- RQ3What is the impact of ConvNeXt versus traditional ResBlock blocks in Fourier-domain vocoding?
- RQ4How does a Fourier-based vocoder compare to EnCodec-like neural codecs in objective and perceptual metrics?
- RQ5Is the phase-wrapping strategy adequate to recover perceptually faithful complex spectrograms?
主要发现
| 模型 | UTMOS (↑) | VISQOL (↑) | PESQ (↑) | V/UV F1 (↑) | 周期性 (↓) |
|---|---|---|---|---|---|
| Ground truth | 4.058 | – | – | – | – |
| HiFi-GAN | 3.669 | 4.57 | 3.093 | 0.9457 | 0.129 |
| iSTFTNet | 3.564 | 4.56 | 2.942 | 0.9372 | 0.141 |
| BigVGAN | 3.749 | 4.65 | 3.693 | 0.9557 | 0.108 |
| Vocos | 3.734 | 4.66 | 3.70 | 0.9582 | 0.101 |
| w/o ConvNeXt | 3.658 | 4.65 | 3.528 | 0.9534 | 0.109 |
- Vocos achieves state-of-the-art or near state-of-the-art perceptual metrics (PESQ, VISQOL) on LibriTTS-derived evaluations.
- Vocos mitigates periodicity artifacts more effectively than time-domain GANs like HiFi-GAN, iSTFTNet, and BigVGAN.
- Vocos foregrounds ConvNeXt blocks; replacing them with ResBlocks slightly degrades performance.
- Inference speed is substantially higher: Vocos is about 13x faster than HiFi-GAN and ~70x faster than BigVGAN on GPU, due to ISTFT-based upsampling.
- On MUSDB18 and out-of-distribution singing voice, Vocos yields higher perceptual quality (VISQOL) than competing models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。