QUICK REVIEW

[Paper Review] A neural attention model for speech command recognition

Douglas Coimbra de Andrade, S. Leo|arXiv (Cornell University)|Aug 27, 2018

Speech Recognition and Synthesis19 references128 citations

TL;DR

The paper presents a convolutional bidirectional LSTM model with an attention mechanism for speech command recognition, achieving state-of-the-art accuracy on Google Speech Commands V1 and V2 with a compact 202K parameters and offering attention visualizations for interpretability.

ABSTRACT

This paper introduces a convolutional recurrent network with attention for speech command recognition. Attention models are powerful tools to improve performance on natural language, image captioning and speech tasks. The proposed model establishes a new state-of-the-art accuracy of 94.1% on Google Speech Commands dataset V1 and 94.5% on V2 (for the 20-commands recognition task), while still keeping a small footprint of only 202K trainable parameters. Results are compared with previous convolutional implementations on 5 different tasks (20 commands recognition (V1 and V2), 12 commands recognition (V1), 35 word recognition (V1) and left-right (V1)). We show detailed performance results and demonstrate that the proposed attention mechanism not only improves performance but also allows inspecting what regions of the audio were taken into consideration by the network when outputting a given category.

Motivation & Objective

Motivate lightweight, locally run speech command recognition for devices without reliable internet access.
Propose a novel attention-based recurrent architecture to improve accuracy on KWS tasks.
Demonstrate state-of-the-art results on Google Speech Commands datasets V1 and V2 across multiple tasks.
Provide attention weight visualizations to make the model's decisions interpretable.
Make source code available to enable reproducibility and further research.

Proposed method

Inputs are raw WAV files converted to numpy arrays and processed into 80-band mel-scale spectrograms via non-trainable Kapre layers.
A time-dimension convolutional stage extracts local temporal features from the mel-spectrograms.
Two stacked bidirectional LSTM layers capture forward and backward temporal dependencies.
An attention-based query mechanism uses a middle LSTM output vector as the query to compute a weighted average of LSTM outputs.
The weighted context is passed through three dense layers with ReLU activations, followed by a softmax classification layer.
Training uses Adam with a starting learning rate of 0.001 and decay, early stopping based on validation performance, and a batch size of 64.

Experimental results

Research questions

RQ1Can an attention-based RNN improve accuracy for small-vocabulary speech command recognition compared to prior lightweight models?
RQ2Does an attention mechanism provide interpretable insights into which temporal regions of audio are most informative for each command?
RQ3What are the performance gains on Google Speech Commands datasets V1 and V2 across multiple tasks (20 commands, 12 commands, 35 words, left-right) with a compact model?
RQ4How does the proposed model compare to previous architectures in terms of parameter count and accuracy?
RQ5Is the model capable of running locally on resource-constrained devices while maintaining high accuracy?

Key findings

Attention RNN achieves state-of-the-art accuracy on Google Speech Commands tasks: 20-commands (V1) 94.1%, (V2) 94.5%; 35-word (V1) 94.3%, (V2) 93.9%; left-right (V1) 99.2%, (V2) 99.4%.
Model size is compact with 202K trainable parameters.
On the 12-command task, attention RNN achieves 95.6% (V1) and 96.9% (V2) with the same parameter budget.
Attention visualizations align with intuition by highlighting vowel transitions and relevant audio regions, enabling model explainability.
Compared to prior models, the Attention RNN provides substantial accuracy gains while maintaining a small footprint.
Confusion matrices reveal challenging pairs (e.g., “three” vs “tree”, “no” vs “down”) and suggest contextual information would improve disambiguation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.