QUICK REVIEW

[Paper Review] Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

Hagen Soltau, Hank Liao|arXiv (Cornell University)|Oct 31, 2016

Speech Recognition and Synthesis15 references57 citations

TL;DR

This paper presents a competitive, end-to-end large vocabulary speech recognition system using a deep bidirectional LSTM RNN with CTC loss to directly predict whole words from acoustic input, eliminating the need for pronunciation lexicons, language models, or decoding. Trained on 125,000 hours of semi-supervised YouTube captions, the model achieves a 13.4% word error rate on a challenging YouTube transcription task, outperforming a strong conventional context-dependent phone-based system.

ABSTRACT

We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.

Motivation & Objective

To develop a simplified, end-to-end speech recognition system that bypasses traditional components like pronunciation lexicons and language models.
To investigate whether direct word-level modeling with deep neural networks can achieve competitive performance on large vocabulary tasks.
To overcome data sparsity in word-level acoustic modeling through large-scale semi-supervised training on 125,000 hours of YouTube captions.
To evaluate the effectiveness of CTC loss in enabling end-to-end training of word-level models without explicit decoding.
To compare the performance of word-level models against strong baseline systems using sub-word units and language models.

Proposed method

The model uses a deep bidirectional LSTM RNN architecture with stacked forward and backward LSTM layers to capture long-range context in acoustic sequences.
The network is trained with Connectionist Temporal Classification (CTC) loss, which allows alignment-free sequence modeling by learning to predict word sequences from raw acoustic frames.
The output layer uses a softmax over a vocabulary of 100,000 words, including numeric entities, with a special blank token to handle variable-length alignments.
The CTC loss function is computed using the forward-backward algorithm over a lattice of all possible alignments between input frames and label sequences.
The model is trained on 125,000 hours of semi-supervised audio captions from public YouTube videos to mitigate data sparsity for word-level units.
Two variants are evaluated: a 'spoken word' model (output in spoken form) and a 'written word' model (output normalized to written form), with both trained end-to-end.

Figure 1: The word posterior probabilities as predicted by the NSR model at each time-frame (30 msec) for a segment of music video ‘Stressed Out’ by Twenty One Pilots. We only plot the word with highest posterior and the missing words from the correct transcription: ‘Sometimes a certain smell will t

Experimental results

Research questions

RQ1Can a deep bidirectional LSTM RNN with CTC loss effectively model large vocabulary speech recognition using whole words as acoustic units?
RQ2Does direct word-level modeling eliminate the need for pronunciation lexicons and language models in end-to-end systems?
RQ3Can sufficient training data compensate for data sparsity in word-level acoustic modeling to achieve competitive performance?
RQ4How does the performance of a CTC-based word model compare to a strong context-dependent phone-based system with language modeling and decoding?
RQ5To what extent does a language model improve the performance of a CTC word model, and how does this compare to its impact on traditional systems?

Key findings

The CTC word model achieves a 13.4% word error rate on a difficult YouTube video transcription task, outperforming a strong conventional context-dependent phone-based system with a 14.2% WER.
The CTC word model achieves 12.0% WER in the spoken domain without any language model or decoding, slightly outperforming a CD phone model with a 30M 5-gram language model.
Adding a language model improves the CTC spoken word model’s WER from 12.0% to 11.6%, indicating a smaller dependency on language modeling compared to traditional systems.
The CTC written word model achieves 13.4% WER when rescoring lattices with a language model, showing only a 0.5% improvement, suggesting the model is already highly robust.
The model demonstrates strong generalization, accurately transcribing music videos despite no training on such content, as shown in qualitative results.
The results confirm that large-scale semi-supervised data (125,000 hours) enables effective training of word-level models, making them viable alternatives to sub-word unit systems.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.