[论文解读] End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
本文提出了一种端到端的全卷积神经网络(FCN)框架用于语音增强,直接优化基于感知的评估指标(如STOI),消除了训练目标与实际性能之间的差距。通过采用话语级优化而非帧级损失,该模型在语音可懂度和自动语音识别(ASR)性能方面优于传统的MMSE优化模型。
Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in most studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered when perception-based objective functions are used for the direct optimization. As an example, we implement the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of test speech is better than conventional MMSE-optimized speech due to the consistency between the training and evaluation target. Moreover, by integrating the STOI in model optimization, the intelligibility of human subjects and automatic speech recognition (ASR) system on the enhanced speech is also substantially improved compared to those generated by the MMSE criterion.
研究动机与目标
- 解决语音增强中模型优化标准(如MMSE)与基于感知的评估指标(如STOI)之间不一致的问题。
- 通过使训练目标与下游评估指标对齐,提升语音可懂度和自动语音识别(ASR)性能。
- 开发一种端到端框架,优化整个话语而非单个帧,以保留长期时序依赖性。
- 证明通过FCN直接优化STOI可显著提升客观和主观语音质量。
- 验证话语级优化在提升语音感知和识别系统性能方面的有效性。
提出的方法
- 该框架采用全卷积神经网络(FCN)端到端处理原始波形输入,避免帧级处理。
- 不使用帧级MMSE损失,而是直接在话语级别优化短时客观可懂度(STOI)指标。
- 通过可微分近似使STOI指标可微,从而实现通过评估函数的反向传播。
- 网络使用最大化增强语音与干净语音之间STOI的损失函数进行训练,以捕捉长期时序相关性。
- 架构采用因果卷积,确保自回归行为并保持波形中的时序顺序。
- 该方法实现了语音增强与可懂度的联合优化,使训练目标与基于感知的评估直接对齐。
实验结果
研究问题
- RQ1在训练期间直接优化STOI指标是否能相比传统的基于MMSE的训练提升语音增强性能?
- RQ2话语级优化是否相比帧级优化能为人类听者和ASR系统带来更好的可懂度?
- RQ3将训练目标与评估指标对齐在多大程度上能缩小模型性能与实际应用场景需求之间的差距?
- RQ4与标准的MMSE优化模型相比,所提出的基于FCN的框架在STOI、可懂度和ASR准确率方面表现如何?
- RQ5可微分STOI能否在端到端语音增强系统中有效用作训练目标?
主要发现
- 所提出的STOI优化模型在测试集上的STOI得分显著高于MMSE优化基线模型。
- 人类听者认为STOI优化模型生成的增强语音比MMSE优化模型更具可懂性。
- 当ASR系统对STOI优化模型增强的语音进行转录时,词错误率(WER)显著降低。
- 话语级优化保留了长期时序相关性,从而产生更自然、更具可懂性的语音输出。
- 通过可微分近似直接优化STOI,实现了有效的反向传播和稳定训练。
- 结果表明,将训练目标与基于感知的指标对齐,可在客观和主观性能上均带来可测量的提升。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。