QUICK REVIEW

[论文解读] DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Yanxin Hu, Yun Liu|arXiv (Cornell University)|Aug 1, 2020

Speech and Audio Processing参考文献 32被引用 59

一句话总结

引入 Deep Complex Convolution Recurrent Network (DCCRN) 用于相位感知的单声道语音增强，使用复数值运算，模型小但在 PESQ/MOS 上表现出色。

ABSTRACT

Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

研究动机与目标

通过利用复数值 CRN 同时建模幅度和相位信息来改善语音增强。
在保持或提升感知质量的同时，减小模型尺寸和计算复杂度。
在 DNS Challenge 的实时与非实时 tracks 上展示相位感知目标的优越性能。

提出的方法

设计一个 Deep Complex Convolution Recurrent Network，使用复数编码器/解码器和复数 LSTM。
使用复数卷积、复数批归一化，以及复数 LSTM 来模拟复数运算。
用目标为复数 CRM 或幅度掩码的信号近似损失进行训练，在时域优化 SI-SNR。
比较四个 DCCRN 变体（R、C、E、CL）与基线 CRN/DCUNET 在 WSJ0-simulated 和 DNS Challenge 数据上的表现。
在训练中使用 SI-SNR 作为损失函数，并在训练阶段使用 STFT/iSTFT 进行波形合成。

实验结果

研究问题

RQ1一个完全复数值的 CRN 是否在相位感知的语音增强方面优于实值或仅幅度目标？
RQ2不同的 DCCRN 目标表示（R、C、E、CL）对客观指标（PESQ）和主观指标（MOS）性能有何影响？
RQ3在 WSJ0 与 DNS Challenge 数据集上，模型大小、实时性与增强质量之间的权衡如何？

主要发现

DCCRN 变体在模拟的 WSJ0 数据集上在 PESQ 方面超过 LSTM 与 CRN 基线。
DCCRN-E 在实时轨道上实现了强势的 DNS Challenge MOS，在非实时轨道也有良好表现；DCCRN-CL 提供了进一步的 PESQ 提升，但在某些片段上可能导致过度抑制。
在 WSJ0 与 DNS 数据上，DCCRN 模型的 PESQ 与 DCUNET 相近但参数显著更少、计算量更低（DCUNET 大约比 DCCRN-CLHeavy 重量级）。
DCCRN-E-Aug（用更多混响训练数据）在有混响的情况下带来渐进的 MOS 增益。
最终的主观评估显示 DCCRN-E 的平均 MOS 约为 3.42（无混响/混响混合），每帧处理时间为 3.12 ms，运行于桌面 CPU/GPU。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。