[论文解读] Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition
本文提出一种基于规则的切换机制,在重叠语音识别中根据估计的信干比(SIR)和信噪比(SNR)在原始语音混合信号与增强语音之间进行切换。该方法通过避免增强处理引入的失真,实现了高达27%的相对CER降低,尤其在高SIR和低SNR条件下效果显著。
Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlapping speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enhancement, which degrades ASR performance. For example, it is well known that single-channel noise reduction for non-speech noise (non-overlapping speech) often does not improve ASR. Likewise, the processing artifacts may also be detrimental to ASR in some conditions when processing overlapping speech with a separation/extraction method, although it is usually believed that separation/extraction improves ASR. In order to answer the question `Do we always have to separate/extract speech from mixtures?', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech. Based on these findings, we propose a simple switching algorithm between observed and enhanced speech based on the estimated signal-to-interference ratio and signal-to-noise ratio. We demonstrated experimentally that such a simple switching mechanism can improve recognition performance when processing artifacts are detrimental to ASR.
研究动机与目标
- 探究在重叠语音场景下,语音增强是否始终能提升ASR性能。
- 识别由于处理失真导致原始混合信号优于增强语音的条件。
- 开发一种简单、基于规则的切换机制,根据SIR和SNR估计选择最优输入(原始混合信号或增强信号)。
- 证明切换机制可在不修改ASR模型或无需联合训练的情况下提升ASR性能。
提出的方法
- 该方法在单通道目标语音提取模型输出的增强语音与原始混合信号上分别评估ASR性能。
- 利用估计的SIR和SNR判断是否切换至原始混合信号或保留增强信号。
- 切换规则定义为:当SIR − SNR ≥ 10 dB时选择原始混合信号,该规则基于开发集性能的实证分析确定。
- 系统采用标准ASR流水线,基于ESPnet的CSJ数据集配方,应用速度扰动和SpecAugment进行数据增强。
- 切换机制在ASR推理前的输入级别应用,不修改ASR模型或训练过程。
- 该方法在包含多种噪声类型(咖啡馆、行人、街道、公交车)及不同SNR/SIR组合的完全重叠数据集上进行评估。
实验结果
研究问题
- RQ1在何种SIR和SNR条件下,尽管干扰被降低,语音增强仍会降低ASR性能?
- RQ2与始终使用增强语音相比,原始语音与增强语音之间的切换能否提升ASR性能?
- RQ3基于估计SIR和SNR的简单、基于规则的切换机制是否能有效提升重叠语音的ASR性能?
- RQ4所提出的切换策略是否在各种噪声和干扰条件下均优于单独使用原始混合信号或增强语音?
主要发现
- 在高SIR(20 dB)和低SNR(0 dB)条件下,语音增强导致ASR性能下降,CER相比原始混合信号最高上升57%。
- 在SIR 10 dB和SNR 0 dB条件下,即使干扰语音处于中等水平,语音提取也未能提升ASR性能。
- 所提出的切换方法在高SIR−SNR条件下相比增强语音实现了27%的相对CER降低,最佳提升出现在SIR 15 dB和SNR 10 dB时。
- 在SIR−SNR ≥10 dB条件下,该切换机制平均降低了22%的CER,表明在各类噪声类型下均具有一致的增益。
- 当切换决策不理想时,该方法仅导致≤1%的CER增加,表明对SIR和SNR估计误差具有鲁棒性。
- 结果表明,当非线性处理引入的失真超过干扰抑制收益时,ASR系统可优于基于增强的前端。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。