Skip to main content
QUICK REVIEW

[Paper Review] Increasing Deep Neural Network Acoustic Model Size for Large Vocabulary Continuous Speech Recognition

Andrew L. Maas, Awni Hannun|arXiv (Cornell University)|Jun 30, 2014
Speech Recognition and Synthesis11 references20 citations
TL;DR

This paper investigates scaling deep neural network (DNN) acoustic models in large-vocabulary continuous speech recognition using a distributed GPU setup. It finds that increasing model size significantly reduces word error rate (WER) when sufficient training data is available—particularly on the 2,000-hour Fisher corpus—demonstrating that larger models yield direct performance gains when training data is abundant.

ABSTRACT

Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Part of the promise of DNNs is their ability to represent increas-ingly complex functions as the number of DNN parameters increases. This paper investigates the performance of DNN-based hybrid speech recognition systems as DNN model size and training data increase. Using a distributed GPU architec-ture, we train DNN acoustic models roughly an order of mag-nitude larger than those typically found in speech recognition systems. DNNs of this scale achieve substantial reductions in final system word error rate despite training with a loss func-tion not tightly coupled to system error rate. However, training word error rate improvements do not translate to large improve-ments in test set word error rate for systems trained on the 300 hour Switchboard conversational speech corpus. Scaling DNN acoustic model size does prove beneficial on the Fisher 2,000 hour conversational speech corpus. Our results show that with sufficient training data, increasing DNN model size is an effec-tive, direct path to performance improvements. Moreover, even smaller DNNs benefit from a larger training corpus. Index Terms: speech recognition, neural networks, acoustic modeling

Motivation & Objective

  • To investigate the impact of increasing DNN acoustic model size on speech recognition performance.
  • To evaluate whether training with a non-error-rate-coupled loss function still yields system improvements as model size grows.
  • To determine whether larger models improve performance on limited-data versus large-data corpora.
  • To assess the interplay between model size and training data scale in hybrid DNN-HMM systems.

Proposed method

  • Trained DNN acoustic models using a distributed GPU architecture to scale model size roughly an order of magnitude beyond typical speech recognition systems.
  • Used a standard DNN training objective (not directly optimized for word error rate) to evaluate generalization under increasing model capacity.
  • Compared performance across two corpora: the 300-hour Switchboard and the 2,000-hour Fisher conversational speech datasets.
  • Measured word error rate (WER) on test sets to evaluate system-level performance after scaling model size and training data.
  • Maintained a hybrid DNN-HMM architecture for speech recognition, focusing on acoustic model improvements.

Experimental results

Research questions

  • RQ1Does increasing DNN acoustic model size lead to measurable reductions in word error rate in large-vocabulary continuous speech recognition?
  • RQ2To what extent do training improvements in WER on the training set translate to test set performance gains?
  • RQ3How does the effectiveness of model scaling depend on the size of the available training data?
  • RQ4Can large DNNs achieve better performance even when trained with a loss function not directly tied to system-level error rate?

Key findings

  • Scaling DNN model size led to substantial reductions in word error rate despite using a loss function not directly optimized for WER.
  • On the 300-hour Switchboard corpus, training WER improvements did not translate into significant test set WER gains, indicating data limitations hindered scalability benefits.
  • On the 2,000-hour Fisher corpus, increasing model size produced clear and measurable improvements in test set word error rate, showing data capacity enables model scaling to yield gains.
  • Even smaller DNNs benefited from larger training corpora, suggesting data and model scaling are synergistic.
  • The results confirm that increasing model size is an effective, direct path to performance improvement when sufficient training data is available.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.