QUICK REVIEW

[论文解读] ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Wei Han, Zhengdong Zhang|arXiv (Cornell University)|May 7, 2020

Speech Recognition and Synthesis参考文献 36被引用 72

一句话总结

ContextNet 在基于 CNN 的音频编码器中通过挤压-激励（squeeze-and-excitation）引入全局上下文，并在 RNN-T 框架下实现，达到 LibriSpeech 上的最先进或接近最先进的 WER，同时参数更少且可扩展性强。它还展示了对于速度/精度权衡的有效下采样。

ABSTRACT

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

研究动机与目标

通过引入全局上下文，推动改进基于 CNN 的 ASR，以缩小与 RNN/Transformer 模型的差距。
提出在 CNN 编码器中加入 squeeze-and-excitation 模块的 ContextNet 架构。
探讨模型规模扩展与渐进式下采样，以在准确性和效率之间取得平衡。

提出的方法

使用具备深度可分离卷积和 Swish 激活的全卷积音频编码器。
在每个卷积块中加入一维 squeeze-and-excitation，以注入全局上下文。
采用 RNN-T 解码器，形成端到端的 CNN-RNN-T 架构。
应用渐进式 8x 的时序下采样以降低计算量。
使用参数 alpha 来缩放模型宽度，以权衡 FLOPs 和准确性。
在 LibriSpeech 上进行训练和评估，使用 SpecAugment 以及 Transformer/LSTM 语言模型进行浅层融合。

实验结果

研究问题

RQ1将通过 squeeze-and-excitation 为 CNN 编码器加入全局上下文，是否能减少 LibriSpeech 的 WER，相较于先前的 CNN 模型？
RQ2渐进式下采样对 ContextNet 的计算量和准确性有何影响？
RQ3ContextNet 在宽度（alpha）扩展方面如何扩展，与 Transformer/LSTM 基线以及 LibriSpeech 上的先前 CNN 模型相比如何？
RQ4在不使用外部语言模型且在噪声更大的测试集上评估时，ContextNet 的鲁棒性如何？
RQ5该方法是否能推广到超出 LibriSpeech 的更大数据集？

主要发现

方法	#Params (M)	dev-clean	dev-other	test-clean	test-other
QuartzNet (CNN)	19	3.90	11.28	2.69	7.25
ContextNet(S)	10.8	2.9	7.0	2.3	5.5
ContextNet(M)	31.4	2.4	5.4	2.0	4.5
ContextNet(L)	112.7	2.1	4.6	1.9	4.1
Transformer [9]	-	2.6	5.7	-	-
Transformer [33]	270	2.89	6.98	2.33	5.17
LSTM	360	2.6	6.0	2.2	5.2
TDNN [35]	192	-	-	-	-

ContextNet(L) 在 LibriSpeech 上无 LM 时达到 1.9% 的 test-clean 与 4.1% 的 test-other WER，有 LM 时为 4.6%/4.1%？（表中数字）
ContextNet(M) 在无 LM 时达到 2.4% 的 dev-clean、5.4% 的 dev-other、2.0% 的 test-clean、4.5% 的 test-other（表中数字）
ContextNet(S) 在无 LM 时达到 2.9% 的 dev-clean、7.0% 的 dev-other、2.3% 的 test-clean、5.5% 的 test-other（表中数字）
ContextNet 在 LibriSpeech 上优于诸如 QuartzNet 等先前的 CNN 模型，并在 WER 和参数高效性方面超越若干 Transformer/LSTM 基线（表2）。
渐进式 8x 下采样显著降低 FLOPs，同时对准确性有适度或正向的影响（表4）。
增大模型宽度（alpha）在更大参数预算下提升 WER（表5）。
在类似 YouTube 数据的大规模实验中，ContextNet 在 WER、参数数量更少和 FLOPs 更低方面超过了先前基于 TDNN 的架构（表6）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。