QUICK REVIEW

[Paper Review] Optimizing Speech Recognition For The Edge

Yuan Shangguan, Jian Li|arXiv (Cornell University)|Sep 26, 2019

Speech Recognition and Synthesis34 references58 citations

TL;DR

The paper presents end-to-end on-device speech recognition optimized via pruning, alternative RNN topologies (CIFG-LSTM and SRU), and quantization, achieving much smaller models with competitive WER.

ABSTRACT

While most deployed speech recognition systems today still run on servers, we are in the midst of a transition towards deployments on edge devices. This leap to the edge is powered by the progression from traditional speech recognition pipelines to end-to-end (E2E) neural architectures, and the parallel development of more efficient neural network topologies and optimization techniques. Thus, we are now able to create highly accurate speech recognizers that are both small and fast enough to execute on typical mobile devices. In this paper, we begin with a baseline RNN-Transducer architecture comprised of Long Short-Term Memory (LSTM) layers. We then experiment with a variety of more computationally efficient layer types, as well as apply optimization techniques like neural connection pruning and parameter quantization to construct a small, high quality, on-device speech recognizer that is an order of magnitude smaller than the baseline system without any optimizations.

Motivation & Objective

Motivate the shift of speech recognition from servers to edge devices while maintaining accuracy.
Explore three primary optimization axes—pruning, architectural variants, and quantization—to build compact, real-time on-device models.
Evaluate combinations of these techniques on a state-of-the-art RNN-T model across diverse datasets.

Proposed method

Develop an automated gradual pruning algorithm to increase weight sparsity while allowing pruned weights to be recoverable.
Compare LSTM, CIFG-LSTM, and SRU cell topologies within the RNN-T framework.
Apply two quantization schemes (hybrid 8-bit/float and integer quantization) for efficient on-device inference.
Use a 8x1 block sparse structure to speed CPU inference and support on-device execution.

Experimental results

Research questions

RQ1Can aggressive pruning reduce model size substantially with minimal accuracy loss for edge-delivered speech recognition?
RQ2Are CIFG-LSTM and SRU architectures viable substitutes for traditional LSTM in encoder/decoder roles within RNN-T?
RQ3Do quantization methods preserve accuracy while delivering real-time performance on mobile CPUs?

Key findings

Sparsity	#Params (M)	% Baseline	VoiceSearch WER	YouTube WER	Telephony WER
0%	122.1	100%	6.6	19.5	8.1
50%	69.7	57%	6.7	20.3	8.2
70%	48.7	39.9%	7.1	20.6	8.5
80%	38.2	31.3%	7.4	21.2	8.9

Pruning yields substantial parameter reductions with modest WER impact across datasets (e.g., 50% sparsity yields 6.7/20.3/8.2 WER on VoiceSearch/YouTube/Telephony).
CIFG-LSTM in encoders and sparse SRU in decoders can reduce parameters by 59% with limited WER degradation (7.1/18.9/8.2).
Quantization (hybrid and integer) preserves accuracy well; integer quantization achieves about 30% of the runtime of float models on Pixel 3 small cores.
A model combining 50% sparse CIFG (encoder) and 30% sparse SRU (decoder) outperforms a small dense LSTM baseline in size and maintains competitive WER.
SRU can substitute for LSTM in the decoder but is less effective in the encoder; CIFG-LSTM offers favorable trade-offs.
Sparse CIFG with quantization can outperform the fully dense small baseline under certain conditions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.