[Paper Review] Fully Convolutional Speech Recognition
The paper presents a fully convolutional, end-to-end speech recognition system that operates on raw waveforms with a learnable front-end and a convolutional language model, achieving state-of-the-art results among end-to-end systems on WSJ and Librispeech.
Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling. This fully convolutional approach is trained end-to-end to predict characters from the raw waveform, removing the feature extraction step altogether. An external convolutional language model is used to decode words. On Wall Street Journal, our model matches the current state-of-the-art. On Librispeech, we report state-of-the-art performance among end-to-end models, including Deep Speech 2 trained with 12 times more acoustic data and significantly more linguistic data.
Motivation & Objective
- Motivate replacing recurrent architectures with fully convolutional networks for end-to-end ASR.
- Demonstrate end-to-end training from raw waveform without hand-crafted features.
- Introduce a convolutional language model for decoding in ASR.
- Evaluate on large vocabulary datasets (WSJ and Librispeech) to establish state-of-the-art among end-to-end systems.
- Analyze learnable front-ends and their impact on performance, especially in noisy conditions.
Proposed method
- A learnable front-end that mimics pre-emphasis and computes feature-like representations from raw waveform.
- A deep convolutional acoustic model with gated linear units trained to predict letters using the Auto Segmentation Criterion (ASG).
- A convolutional language model (GCNN-14B) used to score transcriptions during beam search.
- Beam-search decoding integrating acoustic model scores with a convolutional LM and tuned hyperparameters for LM weight, word insertion reward, and silence penalty.
- Training and evaluation on WSJ (80 hours) and Librispeech (1000 hours), with dataset-specific language model training data and hyperparameter tuning.
Experimental results
Research questions
- RQ1Can a fully convolutional architecture match or exceed recurrent architectures for acoustic and language modeling in end-to-end ASR?
- RQ2Is learning the front-end from raw waveform advantageous over traditional mel-filterbank features, especially in noisy conditions?
- RQ3Does integrating a convolutional language model improve decoding performance compared to traditional n-gram LMs?
- RQ4What are the effects of varying the learnable front-end filter count and LM context on WER across WSJ and Librispeech?
- RQ5How does end-to-end CNN-based ASR perform relative to state-of-the-art systems on WSJ and Librispeech?
Key findings
- The fully convolutional model matches current state-of-the-art on WSJ for end-to-end systems.
- On Librispeech, it achieves state-of-the-art performance among end-to-end models, including DeepSpeech 2, with 2% absolute WER reduction on the noisy test set and ~0.5% on clean speech.
- A convolutional language model yields systematic improvements over a 4-gram LM, with better perplexity and larger receptive field.
- Learning the front-end from raw waveform improves performance, notably in noisy data, and increasing the number of learnable filters yields further gains (e.g., 1.5% absolute WER reduction on Librispeech noisy test set).
- The learned front-end filters tend to cluster around a mel-like, lower-frequency-biased spectrum, suggesting the mel-scale may be suboptimal for ASR.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.