QUICK REVIEW

[论文解读] Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

Szu‐Wei Fu, Yu Tsao|arXiv (Cornell University)|Mar 7, 2017

Speech and Audio Processing参考文献 28被引用 35

一句话总结

本文提出一种全卷积神经网络（FCN）用于端到端的基于原始波形的语音增强，绕过频谱域处理以更好地保留高频分量。该FCN在STOI和PESQ评分上优于基于对数功率谱（LPS）的DNN和CNN模型，同时参数量仅占0.2%，在噪声语音中展现出更优的可懂度和质量恢复能力。

ABSTRACT

This study proposes a fully convolutional network (FCN) model for raw waveform-based speech enhancement. The proposed system performs speech enhancement in an end-to-end (i.e., waveform-in and waveform-out) manner, which dif-fers from most existing denoising methods that process the magnitude spectrum (e.g., log power spectrum (LPS)) only. Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform. More specifically, FCN consists of only convolutional layers and thus the local temporal structures of speech signals can be efficiently and effectively preserved with relatively few weights. Experimental results show that DNN- and CNN-based models have limited capability to restore high frequency components of waveforms, thus leading to decreased intelligibility of enhanced speech. By contrast, the proposed FCN model can not only effectively recover the waveforms but also outperform the LPS-based DNN baseline in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). In addition, the number of model parameters in FCN is approximately only 0.2% compared with that in both DNN and CNN.

研究动机与目标

为解决现有语音增强方法依赖对数功率谱（LPS）等频谱表示所带来的局限性，这些方法可能扭曲高频分量。
通过直接建模原始波形，提升噪声环境下语音的可懂度和质量。
在保持或提升性能的同时，降低模型复杂度和参数数量，相较于DNN和CNN基线模型。
探索全卷积网络（FCN）在保留语音信号局部时序结构方面的有效性。
证明端到端波形处理可优于传统的两阶段方法，在语音增强任务中表现更优。

提出的方法

所提模型仅使用卷积层（无全连接层）直接处理原始语音波形，实现从输入波形到输出增强波形的端到端学习。
架构采用空洞卷积以在不增加参数量的前提下扩大感受野，从而有效建模长程依赖关系。
使用增强波形与干净参考波形之间的均方误差（MSE）损失进行训练。
模型设计为全卷积结构，可接受可变长度的输入序列，并在整个网络中保持空间（时间）分辨率。
避免使用池化层，以保留细微的时序细节，尤其在高频分量中。
模型在原始波形对（噪声 vs. 干净）上进行端到端训练，无需中间频谱表示。

实验结果

研究问题

RQ1全卷积网络（FCN）能否在不进行频谱变换的情况下，有效实现端到端的原始波形语音增强？
RQ2与基于频谱特征的DNN和CNN模型相比，FCN架构是否能更好地保留高频分量？
RQ3FCN在保持或提升基线模型性能的同时，能在多大程度上降低模型复杂度（参数量）？
RQ4与基于LPS的DNN和CNN基线相比，FCN在STOI和PESQ等客观指标上的表现如何？
RQ5FCN中缺少全连接层是否能带来更好的泛化能力，并更有效地保留语音信号中的局部时序结构？

主要发现

FCN模型在短时客观可懂度（STOI）和语音质量感知评价（PESQ）评分上均优于基于LPS的DNN基线模型。
FCN模型能有效恢复语音波形中的高频分量，而基于频谱特征的DNN和CNN模型往往难以恢复这些分量。
FCN中的模型参数数量约为DNN和CNN基线模型的0.2%，显著降低了模型复杂度。
FCN在STOI和PESQ上均优于基于LPS的DNN基线，证明了端到端波形处理的优势。
全卷积设计比包含全连接层的模型更有效地保留了局部时序结构，尤其在高频区域表现更优。
该模型以极低的参数量实现了波形语音增强的最先进性能，表明其具有高度的效率与有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。