QUICK REVIEW

[论文解读] Phase-aware Speech Enhancement with Deep Complex U-Net

Hyeong-Seok Choi, Jang-Hyun Kim|arXiv (Cornell University)|Mar 7, 2019

Speech and Audio Processing被引用 79

一句话总结

本文提出用于相位感知语音增强的 Deep Complex U-Net、一个极坐标的复数掩蔽方案，以及一个 wSDR 损失，以提升超越幅度仅方法的重建质量。

ABSTRACT

Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram while reusing the phase from noisy speech for reconstruction. This is due to the difficulty of estimating the phase of clean speech. To improve speech enhancement performance, we tackle the phase estimation problem in three ways. First, we propose Deep Complex U-Net, an advanced U-Net structured model incorporating well-defined complex-valued building blocks to deal with complex-valued spectrograms. Second, we propose a polar coordinate-wise complex-valued masking method to reflect the distribution of complex ideal ratio masks. Third, we define a novel loss function, weighted source-to-distortion ratio (wSDR) loss, which is designed to directly correlate with a quantitative evaluation measure. Our model was evaluated on a mixture of the Voice Bank corpus and DEMAND database, which has been widely used by many deep learning models for speech enhancement. Ablation experiments were conducted on the mixed dataset showing that all three proposed approaches are empirically valid. Experimental results show that the proposed method achieves state-of-the-art performance in all metrics, outperforming previous approaches by a large margin.

研究动机与目标

通过超越仅重复噪声相位来解决相位估计问题，从而推动更好的语音增强。
开发一个带有复数值构件的 Deep Complex U-Net，以处理复数谱图。
提出一种极坐标逐点的复数值掩蔽方法，以更好地反映复数掩蔽的分布。
引入一个与评估指标对齐的加权源到失真比（wSDR）损失。
在标准混合语音数据集上通过消融实验展示经验提升。

提出的方法

将 U-Net 扩展为带有复数值层，以在复数谱图上工作。
引入极坐标逐点的复数值掩蔽，以对相位和幅度进行联合建模。
定义并使用一个加权 SDR（wSDR）损失，以与定量指标相关联。
在 Voice Bank + DEMAND 混合集数据集上进行评估并进行消融研究。
与先前仅估计幅度并复用噪声相位的方法进行比较。

实验结果

研究问题

RQ1带复数值的 U-Net 是否可以在相位感知语音增强方面优于以幅度为焦点的模型？
RQ2相比实值掩蔽，极坐标逐点的复数值掩蔽是否更能捕捉复数掩蔽的分布？
RQ3wSDR 损失是否能直接提升与客观评估指标的一致性？
RQ4每个提出的组成部分（复数 U-Net、极坐标掩蔽、wSDR）对总体性能的贡献是什么？
RQ5与先前方法相比，所提出的方法在标准混合语音数据集上的表现如何？

主要发现

该方法在混合的 Voice Bank 和 DEMAND 数据集上的所有指标上都实现了最先进的性能。
消融实验证实了三者提出方法的经验有效性。
根据摘要，该模型比先前的方法有显著的提升。
复数建模、极坐标掩蔽和 wSDR 损失的结合带来改进的增强结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。