Skip to main content
QUICK REVIEW

[Paper Review] WaveGrad: Estimating Gradients for Waveform Generation

Nanxin Chen, Yu Zhang|arXiv (Cornell University)|Sep 2, 2020
Music and Audio Processing59 references44 citations
TL;DR

WaveGrad is a diffusion/score-based conditional waveform generator that estimates data-density gradients to produce high-fidelity audio non-autoregressively, achieving quality close to autoregressive baselines with as few as six refinement steps and faster inference.

ABSTRACT

This paper introduces WaveGrad, a conditional model for waveform generation which estimates gradients of the data density. The model is built on prior work on score matching and diffusion probabilistic models. It starts from a Gaussian white noise signal and iteratively refines the signal via a gradient-based sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality. We find that it can generate high fidelity audio samples using as few as six iterations. Experiments reveal WaveGrad to generate high fidelity audio, outperforming adversarial non-autoregressive baselines and matching a strong likelihood-based autoregressive baseline using fewer sequential operations. Audio samples are available at https://wavegrad.github.io/.

Motivation & Objective

  • Motivate fast, high-quality waveform generation beyond autoregressive models.
  • Leverage gradient-of-data-density (score) learning to model conditional audio distributions.
  • Develop a non-autoregressive generator with a controllable trade-off between inference speed and sample quality.
  • Investigate conditioning schemes (continuous noise level vs discrete step index) for robust inference.
  • Evaluate against autoregressive and non-autoregressive baselines on MOS and objective metrics.

Proposed method

  • Model learns the gradient of the data log-density (score) and uses a Langevin-dynamics–like sampler for inference.
  • Adapts diffusion probabilistic models to conditional waveform generation with mel-spectrogram conditioning.
  • Trains with a weighted denoising score-matching objective conditioned on a continuous noise level ¯α (vs discrete step index).
  • Uses a gradient-based sampler to progressively denoise from Gaussian noise starting at yN to y0.
  • Architecture is fully convolutional and non-autoregressive, enabling parallel inference.
  • Evaluates continuous-noise-level conditioning vs discrete-index conditioning and analyzes noise schedules and iteration counts.

Experimental results

Research questions

  • RQ1Can WaveGrad generate high-fidelity audio in a non-autoregressive framework while matching autoregressive baselines?
  • RQ2Does conditioning on a continuous noise level improve flexibility and sample quality compared to conditioning on a discrete iteration index?
  • RQ3What is the impact of the number of inference iterations on audio quality and speed, and how do different noise schedules affect performance?
  • RQ4How does WaveGrad compare to established vocoders (autoregressive and non-autoregressive) on subjective MOS and objective metrics?

Key findings

  • WaveGrad matches the autoregressive WaveRNN baseline in MOS while outperforming several non-autoregressive baselines.
  • Six inference iterations with continuous-noise conditioning yield high-fidelity audio (MOS ~4.41) and real-time factor (RTF) of 0.2 on an NVIDIA V100 GPU.
  • Discrete-index conditioned variants require training separate models per schedule, while continuous-noise conditioning enables a single model to support multiple schedules.
  • Continuous-noise conditioning generalizes better and maintains quality with few iterations compared to discrete conditioning.
  • WaveGrad Base with six iterations achieves comparable MOS to 1,000-iteration discrete models, while significantly speeding up inference (RTF 0.2).
  • Overall, WaveGrad can generate high-fidelity audio with far fewer sequential operations than WaveRNN (which had RTF ~20.1 on the same GPU).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.