QUICK REVIEW

[Paper Review] Universal Speech Enhancement with Score-based Diffusion

Joan Serrà, Santiago Pascual|arXiv (Cornell University)|Jun 7, 2022

Speech and Audio Processing63 citations

TL;DR

The paper introduces a universal speech enhancement approach based on score-based diffusion, aiming to improve single- and multi-condition speech quality through diffusion probabilistic modeling.

ABSTRACT

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

Motivation & Objective

Motivate a general, universal speech enhancement solution that works across diverse acoustic conditions.
Leverage score-based diffusion models to model the clean speech distribution conditioned on noisy input.
Develop training and sampling procedures grounded in score matching to enable effective denoising of speech across contexts.

Proposed method

Proposes a diffusion-based framework for speech enhancement using score-based generative modeling.
Utilizes denoising score matching and stochastic differential equation (SDE) formulation to model gradients of the data distribution.
employs annealed sampling and conditioning on the noisy input to produce enhanced waveform estimates.
Grounds the approach in prior diffusion and score-matching literature to enable robust denoising in the waveform domain.

Experimental results

Research questions

RQ1Can score-based diffusion models provide universal applicability for speech enhancement across varied noise/types and recording conditions?
RQ2How can conditioning on noisy speech guide the diffusion process to yield high-quality, artifact-free enhanced speech?
RQ3What training and sampling strategies yield effective denoising while maintaining perceptual quality?
RQ4How does the proposed method compare to existing diffusion-based or non-diffusion-based speech enhancement approaches?

Key findings

Introduces a universal speech enhancement method based on score-based diffusion.
Describes training and sampling procedures leveraging score matching for waveform denoising.
Positions the approach within the broader diffusion framework for audio generation and inference.
Discusses potential benefits in generalization and robustness across diverse acoustic conditions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.