QUICK REVIEW

[論文レビュー] Phase-aware Speech Enhancement with Deep Complex U-Net

Hyeong-Seok Choi, Jang-Hyun Kim|arXiv (Cornell University)|Mar 7, 2019

Speech and Audio Processing被引用数 79

ひとこと要約

本論文は、位相認識の音声強調のための Deep Complex U-Net、極座標系の複素マスキング方式、および復元品質を大幅に向上させるための wSDR 損失を提案し、振幅のみの手法を超える性能を実現します。

ABSTRACT

Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.

研究の動機と目的

改善された音声強調を動機づけ、ノイズ位相の再利用を超えた位相推定を検討する。
Complex なビルディングブロックを用いた Deep Complex U-Net を、複素スペクトログラムに対して開発する。
位相と振幅を jointly に反映させる極座標系ベースの複素値マスキング手法を提案する。
評価指標に合わせた重み付きソース対歪み比（wSDR）損失を導入する。
標準的な混合音声データセットに対するアブレーションで経験的向上を示す。

提案手法

複素スペクトログラム上で動作するように複素値レイヤーを備えた U-Net を拡張する。
位相と振幅を同時にモデリングするための極座標系ベースの複素値マスキングを導入する。
定義し、定量的指標と相関させるための重み付き SDR（wSDR）損失を用いる。
Voice Bank + DEMAND の混合データセットで評価し、アブレーション研究を実施する。
振幅推定のみを行いノイズ位相を再利用する従来手法と比較する。

実験結果

リサーチクエスチョン

RQ1複素値 U-Net は、振幅中心のモデルよりも位相認識型の音声強調を改善できるか。
RQ2極座標系ベースの複素マスキングは、実数値マスキングよりも複素マスクの分布をより良く捉えるか。
RQ3wSDR 損失は客観評価指標との整合を直接改善するか。
RQ4提案構成要素（複素 U-Net、極マスキング、wSDR）の各寄与は全体性能にどの程度影響するか。
RQ5提案手法は従来法と比較して、標準的な混合音声データセットでどの程度優れているか。

主な発見

本手法は混合データセットの Voice Bank および DEMAND においてすべての指標で最先端の性能を達成した。
アブレーション実験により、3つの提案アプローチすべての実証的妥当性が確認された。
要約によると、モデルは従来の手法を大幅に上回る性能を示す。
複素値モデリング、極マスキング、および wSDR 損失の組み合わせにより、改善された強調結果を得られる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。