QUICK REVIEW

[论文解读] A neural attention model for speech command recognition

Douglas Coimbra de Andrade, S. Leo|arXiv (Cornell University)|Aug 27, 2018

Speech Recognition and Synthesis参考文献 19被引用 128

一句话总结

本论文提出一种带注意力机制的卷积双向LSTM模型用于语音命令识别，在 Google Speech Commands V1 和 V2 上实现最先进的准确率，参数量紧凑，为202K；并提供注意力可视化以增强可解释性。

ABSTRACT

This paper introduces a convolutional recurrent network with attention for speech command recognition. Attention models are powerful tools to improve performance on natural language, image captioning and speech tasks. The proposed model establishes a new state-of-the-art accuracy of 94.1% on Google Speech Commands dataset V1 and 94.5% on V2 (for the 20-commands recognition task), while still keeping a small footprint of only 202K trainable parameters. Results are compared with previous convolutional implementations on 5 different tasks (20 commands recognition (V1 and V2), 12 commands recognition (V1), 35 word recognition (V1) and left-right (V1)). We show detailed performance results and demonstrate that the proposed attention mechanism not only improves performance but also allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.

研究动机与目标

激励在没有可靠互联网连接的设备上实现轻量级、就地运行的语音命令识别。
提出一种新颖的基于注意力的循环结构，以提高关键词识别任务的准确性。
在 Google Speech Commands 数据集 V1 和 V2 的多任务上展示最先进结果。
提供注意力权重可视化，使模型的决策具有可解释性。
提供源代码以实现可重复性和进一步研究。

提出的方法

输入是原始 WAV 文件，转换为 numpy 数组，并通过非可训练的 Kapre 层处理成 80 带梅尔刻度频谱。
一个时域卷积阶段从梅尔频谱中提取局部时序特征。
两层堆叠的双向LSTM捕捉前向和后向时序依赖。
基于注意力的查询机制使用中间 LSTM 输出向量作为查询，计算 LSTM 输出的加权平均。
加权上下文通过三个带 ReLU 激活的全连接层，随后是一个 softmax 分类层。
训练使用 Adam，起始学习率 0.001 并衰减，基于验证性能进行早停，批量大小 64。

实验结果

研究问题

RQ1一个基于注意力的 RNN 能否相较于先前的轻量级模型提高小词汇量语音命令识别的准确性？
RQ2注意力机制是否为每个命令提供可解释的洞见，显示哪些音频的时间区域最具信息量？
RQ3在 Google Speech Commands 数据集 V1 和 V2 的多任务上（20 个命令、12 个命令、35 个词、左右）使用紧凑模型的性能提升？
RQ4与以往架构相比，该模型在参数数量和准确性方面有何差异？
RQ5模型能否在资源受限设备上就地运行并保持高准确性？

主要发现

Attention RNN 在 Google Speech Commands 任务上达到最先进的准确率：20-命令 (V1) 94.1%，(V2) 94.5%；35-词 (V1) 94.3%，(V2) 93.9%；左/右 (V1) 99.2%，(V2) 99.4%。
模型规模为紧凑的 202K 可训练参数。
在 12-命令任务上，attention RNN 在相同参数预算下分别达到 95.6% (V1) 和 96.9% (V2)。
注意力可视化通过突出元音转换和相关音频区域，与直觉对齐，提升模型可解释性。
与先前模型相比，Attention RNN 在保持小型占用的同时提供显著的准确性提升。
混淆矩阵揭示了具有挑战性的对（如“three”与“tree”、“no”与“down”），并提示上下文信息能够改进判别。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。