QUICK REVIEW

[論文レビュー] Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation

Shahar Lutati, Eliya Nachmani|arXiv (Cornell University)|Jan 25, 2023

Speech and Audio Processing被引用数 12

ひとこと要約

要約: 本研究は、事前学習済み拡散ベースのボコーダを決定論的ソース分離モデルの出力に適用することで、マルチ話者の音声分離を改善し、状態最先端の結果を達成するとともに、スペクトル領域で決定論的出力と生成的出力を線形結合することでいくつかのケースで決定論的上界を超えることを示している。

ABSTRACT

The problem of speech separation, also known as the cocktail party problem, refers to the task of isolating a single speech signal from a mixture of speech signals. Previous work on source separation derived an upper bound for the source separation task in the domain of human speech. This bound is derived for deterministic models. Recent advancements in generative models challenge this bound. We show how the upper bound can be generalized to the case of random generative models. Applying a diffusion model Vocoder that was pretrained to model single-speaker voices on the output of a deterministic separation model leads to state-of-the-art separation results. It is shown that this requires one to combine the output of the separation model with that of the diffusion model. In our method, a linear combination is performed, in the frequency domain, using weights that are inferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks. In particular, for two speakers, our method is able to surpass what was previously considered the upper performance bound.

研究の動機と目的

音声分離において非決定論的生成モデルを使用した場合の上限値をモチベーションと公式化する。
事前学習済み拡散ボコーダが決定論的出力と組み合わせた際に分離を改善できることを示す。
ハイブリッド決定論-生成パイプラインの相互情報とSDRの理論的境界を導出する。
決定論的推定と生成推定を結合する学習可能なスペクトル領域フュージョンを提案する。
LibriSpeechとWSJ0での2・3・5・10・20話者に対して改善を経験的に検証する。

提案手法

音声混合物に対してバックボーン分離器 B を適用し、各ソースについて複数の推定 ϕvdを得る。
各 ϕvdを事前学習済み拡散ボコーダ GM に通し、各ソースについて ϕvgを得る。
両方の ϕvdと ϕvgをメルスペクトログラムに変換し、振幅と位相を連結して学習済みアライメントネットワーク F の入力とする。
F を介して複素混合ウェイト [β5, β7] を計算し、最終スペクトル推定を V = β5 ⋅ Vd + β7 ⋅ Vg と形成し、その後逆 STFT により時系列信号を得る。
対応付けと目的関数として SI-SDR の Hungarian アサインメントでアライメントネットワーク F のみを訓練する。
DiffWave を単話者データ（LibriMix/WSJ0）で事前学習済みの GM として使用し、B は公開モデル（例：Gated-LSTM や SepFormer）に基づく。）

実験結果

リサーチクエスチョン

RQ1 pretrained diffusion model が決定論的ソース分離の後処理プリオリかとして機能して、分離を改善できるか？
RQ2決定論的推定と生成推定を組み合わせた場合に達成可能な最大の改善を支配する理論的境界は何か？
RQ3スペクトル領域での重み融合を学習することは、ヒューリスティックな位相整合法より優れているか？
RQ4標準的なベンチマークで2–20話者へとスケールする際のアプローチの性能はどう変化するか？
RQ5非決定論的生成コンポーネントを用いれば、決定論的モデルの古典的上界を超えることは可能か？

主な発見

手法	WSJ0 2Mix	WSJ0 3Mix	LibriSpeech 2Mix	LibriSpeech 5Mix	LibriSpeech 10Mix	LibriSpeech 20Mix
Classical Upper Bound (Lutati et al.)	23.1	21.2	23.1	14.5	12.0	8.0
Generative Upper Bound (ours)	26.1	24.2	26.1	17.5	15.0	11.0
DiffSep [27]	14.3	-	-	-	-	-
SepIt [22]	22.4	20.1	-	13.7	8.2	-
SepFormer [30]	22.3	19.8	20.6	-	-	-
SepFormer + HiFiGAN [13]	22.3	20.0	-	-	-	-
SepFormer + DiffWave -F (ablation)	22.6	20.3	20.8	-	-	-
SepFormer + DiffWave (ours)	23.9	20.9	21.5	-	-	-
Gated LSTM [24]	20.1	16.9	-	12.7	7.7	4.3
Gated LSTM + DiffWave -F (ablation)	- ∗	- ∗	-	13.0	8.1	4.5
Gated LSTM + DiffWave (ours)	- ∗	- ∗	-	14.2	9.0	5.2

拡散ベースのボコーダを決定論的セパレータの出力に適用すると、2・3・5・10・20話者で最先端の SI-SDR 改善をもたらす。
2話者の場合、本手法は決定論的モデルの既存上界を超える。
スペクトル領域での学習されたフュージョン（Fを介して）は、ヒューリスティックな位相整合法や単純な平均化を上回る。
WSJ0およびLibriSpeechのベンチマーク全体で、提示設定において最大で SDR 関連の約3 dB の利得を達成。
提案された境界は、 reasonable assumptions の下で決定論的上界を超えた最大でおよそ 3 dB の改善が得られることを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。