QUICK REVIEW

[论文解读] Scaling sparsemax based channel selection for speech recognition with ad-hoc microphone arrays

Junqi Chen, Xiao-Lei Zhang|arXiv (Cornell University)|Mar 28, 2021

Speech and Audio Processing参考文献 20被引用 7

一句话总结

本文提出了一种名为 Scaling Sparsemax 的新型通道选择方法，用于大规模临时麦克风阵列的多通道端到端语音识别。通过在流注意力机制中用 Scaling Sparsemax 替代 Softmax 算子，模型仅选择性地抑制最受损的通道，同时保留有用的通道，在模拟数据上相比 Softmax 实现了超过 30% 的相对 WER 降低，并在半真实数据上超越了甚至优于已知最优的 oracle one-best 基线。

ABSTRACT

Recently, speech recognition with ad-hoc microphone arrays has received much attention. It is known that channel selection is an important problem of ad-hoc microphone arrays, however, this topic seems far from explored in speech recognition yet, particularly with a large-scale ad-hoc microphone array. To address this problem, we propose a Scaling Sparsemax algorithm for the channel selection problem of the speech recognition with large-scale ad-hoc microphone arrays. Specifically, we first replace the conventional Softmax operator in the stream attention mechanism of a multichannel end-to-end speech recognition system with Sparsemax, which conducts channel selection by forcing the channel weights of noisy channels to zero. Because Sparsemax punishes the weights of many channels to zero harshly, we propose Scaling Sparsemax which punishes the channels mildly by setting the weights of very noisy channels to zero only. Experimental results with ad-hoc microphone arrays of over 30 channels under the conformer speech recognition architecture show that the proposed Scaling Sparsemax yields a word error rate of over 30% lower than Softmax on simulation data sets, and over 20% lower on semi-real data sets, in test scenarios with both matched and mismatched channel numbers.

研究动机与目标

解决在大规模临时麦克风阵列中用于远距离语音识别的通道选择这一尚未被探索的挑战。
通过直接优化识别级指标而非依赖 SNR 等信号质量代理指标，提升自动语音识别（ASR）性能。
开发一种可扩展的、可微分的通道选择机制，能够处理超过 30 个麦克风的阵列。
采用两阶段训练策略：首先在干净数据上训练单通道 conformer，然后在多通道噪声数据上微调以通过流注意力学习通道选择。
在模拟和半真实环境中，超越传统的 Softmax 和现有通道选择基线，包括 oracle one-best 方法。

提出的方法

在流注意力机制中用 Sparsemax 替代 Softmax 算子，通过强制将噪声通道的权重设为零来实现通道选择。
提出 Scaling Sparsemax，一种可微分的、温和的通道剪枝方法，仅将最受损通道的权重设为零，避免过度惩罚。
设计两阶段训练策略：在干净 Librispeech 数据上预训练单通道 conformer，然后在来自临时阵列的多通道噪声数据上微调流注意力模块。
采用基于 conformer 的 ASR 架构，在编码器和解码器中均使用多头注意力，并集成流注意力模块以重新加权和融合来自多个通道的特征。
使用多通道噪声数据训练流注意力模块，以学习最优通道权重，其中解码器隐藏状态生成的引导向量用作查询输入。
应用 SpecAugment 进行数据增强，并在推理时使用贪婪解码，不使用语言模型。

实验结果

研究问题

RQ1可微分的、基于注意力的通道选择机制是否能提升大规模临时麦克风阵列中的 ASR 性能？
RQ2在流注意力中用 Sparsemax 或 Scaling Sparsemax 替代 Softmax 是否能获得比传统 Softmax 或 oracle one-best 选择更低的 WER？
RQ3在通道数不匹配的情况下（例如，16 通道训练，30 通道测试），所提方法表现如何？
RQ4Scaling Sparsemax 是否能超越 oracle one-best 基线，后者假设已知距离最近的麦克风？
RQ5两阶段训练策略——先在干净数据上预训练，再在噪声多通道数据上微调——是否能提升模型收敛性和性能？

主要发现

在模拟的 Libri-adhoc-simu 数据集上，30 通道测试数据的 'test-clean' 集中，Scaling Sparsemax 相较 Softmax 实现了 33.90% 的相对 WER 降低。
在半真实的 Libri-adhoc40 数据集上，20 通道测试场景中，Scaling Sparsemax 相较 oracle one-best 基线实现了 17.4% 的相对 WER 降低。
在 Libri-adhoc40 数据集的 30 通道不匹配测试场景中，Scaling Sparsemax 相较 oracle 基线也实现了 14.2% 的相对 WER 降低。
在 30 通道模拟测试中，模型相较 Softmax 实现了超过 30% 的相对 WER 降低，证明了在大规模阵列中通道选择的有效性。
可视化结果表明，Softmax 仅重新加权通道，Sparsemax 过度惩罚了多个通道，而 Scaling Sparsemax 仅选择性地抑制最受损的通道，从而实现最优性能。
两阶段训练策略成功避免了在引入极噪声通道时的训练失败，并提升了在通道数不匹配场景下的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。