QUICK REVIEW

[論文レビュー] Universal Speech Enhancement with Score-based Diffusion

Joan Serrà, Santiago Pascual|arXiv (Cornell University)|Jun 7, 2022

Speech and Audio Processing被引用数 63

ひとこと要約

本論文は、score-based diffusion に基づくユニバーサルな音声強調アプローチを提案し、拡散確率モデリングを通じて単一条件および多条件の音声品質を改善することを目的とする。

ABSTRACT

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

研究の動機と目的

多様な音響条件に対して機能する一般的でユニバーサルな音声強調ソリューションを提案する。
ノイズのある入力に条件づけられたクリーンな音声分布をモデル化するために、score-based diffusionモデルを活用する。
文脈を横断する効果的な denoising を実現するため、score matchingに基づくトレーニングおよびサンプリング手法を開発する。

提案手法

score-based生成モデルを用いた音声強調のための拡散ベースのフレームワークを提案する。
デノイジングスコアマッチングと確率微分方程式（SDE）形式を用いてデータ分布の勾配をモデル化する。
アニーリングサンプリングとノイズ付き入力への条件付けを用いて強化された波形推定を生成する。
波形領域での頑健なデノイズを可能にするため、先行する拡散およびスコアマッチングの文献に基づく。

実験結果

リサーチクエスチョン

RQ1score-based diffusionモデルは、さまざまなノイズ・タイプおよび録音条件を横断して音声強調に対する普遍的な適用性を提供できるか？
RQ2ノイズ付き音声への条件付けをどのように拡散過程に導入して高品質でアーチファクトのない強化音声を得るか？
RQ3知覚品質を保ちつつ効果的なデノイズをもたらすトレーニングおよびサンプリング戦略は何か？
RQ4提案手法は、既存の拡散ベースまたは非拡散ベースの音声強調手法とどのように比較されるか？

主な発見

score-based diffusionに基づくユニバーサルな音声強調手法を提案する。
波形デノイズのためのscore matchingを活用したトレーニングおよびサンプリング手法を詳述する。
オーディオ生成・推論のより広い拡散フレームワークの中でこの手法を位置づける。
多様な音響条件全体での一般化と頑健性の潜在的利点を議論する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。