QUICK REVIEW

[論文レビュー] BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis

Yichong Leng, Zehua Chen|arXiv (Cornell University)|May 30, 2022

Speech and Audio Processing被引用数 29

ひとこと要約

BinauralGrad は、モノラル入力から最初の段階で両耳共通情報を生成し、第二段階で左右の特定の差分を付加して高忠実度のバイノーラル合成を実現する、2段階の拡散ベースフレームワークを導入します。

ABSTRACT

Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.

研究の動機と目的

没入型AR/VR体験のために、モノラル入力からの正確なバイノーラル音声合成を動機づける。
両耳で共有される共通成分と、各耳ごとの特有成分にバイノーラル信号を分解する。
モノ条件付けのもと、拡散確率モデルを用いて共通情報と特定情報を同時にモデリングする。

提案手法

バイノーラル音声を y = (y^l, y^r) と表現し、共通成分 ̄y_l と各耳のデルタ ∂δ^l, ∂δ^r を用いる。
Stage 1: 単一チャネル拡散モデルをモノ音声で条件付けて学習し、モノ入力と関連条件から共通成分 ̄y を生成する。
Stage 2: 第1段の出力を条件付けた2チャンネル拡散モデルを学習し、左チャンネルと右チャンネルを生成し、差分をモデル化する。
拡散モデルの形式を前向き q と後向き p の過程、および各ステップでノイズを回復する回帰目的関数 L_D(θ) を用いる。
アーキテクチャは、位置情報と条件付き音声情報を融合させる Conditioner モジュールを備えた双方向拡張畳み込みを採用する。

実験結果

リサーチクエスチョン

RQ1共通情報と耳ごとの特有成分を分離することで、モノ音声からのバイノーラル音声合成を2段階拡散フレームワークで改善できるか？
RQ2共通段をモノ音声で、特定段を1段目の出力で条件付けることが、客観指標と人間の知覚品質にどう影響するか？
RQ3Wave L2、振幅/L2、位相L2、PESQ、MRSTFT におけるベースラインに対する定量的な向上は？
RQ42段階をエンドツーエンドまたは結合学習することで、さらなる性能向上が得られるか？
RQ5アブレーション実験から、各段階が全体の品質に与える寄与についてどんな知見が得られるか？

主な発見

モデル	Wave L2 (×10^-3) ↓	振幅 L2 ↓	位相 L2 ↓	PESQ ↑	MRSTFT ↓
DSP	1.543	0.097	1.596	1.610	2.750
WaveNet [24]	0.179	0.037	0.968	2.305	1.915
WarpNet [29]	0.157	0.038	0.838	2.360	1.774
BinauralGrad	0.128	0.030	0.837	2.759	1.278

BinauralGrad は Wave L2 = 0.128 (×10^-3) および PESQ = 2.759 を達成し、 DSP、WaveNet、WarpNet のベースラインをベンチマークで上回る。
振幅 L2 と MRSTFT は、ベースラインと比較して顕著な改善を示す（それぞれ 0.030 および 1.278）
MOS の結果は BinauralGrad を支持し 3.80、全体的な自然さと金標準のバイノーラル音声との類似性で全ベースラインを上回る。
2段階モデルは、ほとんどの指標で単一段階の拡散モデルを上回り、共通情報と特定情報を分離する利点を確認した。
アブレーションは Stage 1（共通情報）がボトルネックであり、両段階をエンドツーエンドで最適化することで改善が得られる可能性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。