QUICK REVIEW

[Paper Review] Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Ronan Collobert, Christian Puhrsch|arXiv (Cornell University)|Sep 11, 2016

Speech Recognition and Synthesis24 references248 citations

TL;DR

The paper presents an end-to-end ConvNet-based acoustic model trained with an AutoSegCriterion (ASG) for grapheme-based speech recognition, paired with a simple beam-search decoder, achieving competitive LibriSpeech results without force alignment or HMM/GMM pipelines.

ABSTRACT

This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

Motivation & Objective

Eliminate the need for force-aligned phonetic transcriptions in ASR by training directly on graphemes.
Propose a simple end-to-end architecture using 1D ConvNets and a graph-based segmentation criterion.
Demonstrate competitive Word Error Rates on LibriSpeech with MFCC, power spectrum, and raw waveform inputs.
Show that ASG can match or exceed CTC in speed and accuracy on standard benchmarks.

Proposed method

Use 1D convolutional neural networks as the acoustic model to map input features (MFCC, power spectrum, or raw waveform) to letter scores.
Introduce the AutoSegCriterion (ASG), a graph-based segmentation criterion with un-normalized node scores and global normalization that avoids a blank label.
Train with ASG using an unfolded graph over time, optimizing a forward-score via logadd operations similar to CTC but without blanks.
Incorporate a simple one-pass beam-search decoder with language model integration (KenLM) and word insertion penalties.
Evaluate on LibriSpeech using 16 kHz audio, a 30-letter grapheme set (including apostrophe, silence, and repetition markers), and compare ASG to CTC.

Experimental results

Research questions

RQ1Can end-to-end grapheme-based acoustic models without force alignment achieve competitive WER on LibriSpeech?
RQ2Does the AutoSegCriterion provide equal or better performance and speed compared to CTC for sequence labeling without blanks?
RQ3How do MFCC, power spectrum, and raw waveform inputs compare in end-to-end grapheme ASR under this architecture?
RQ4What is the impact of data augmentation and training size on ASG performance?
RQ5How well does the simple decoder with an external language model perform on standard benchmarks?

Key findings

ASG achieves comparable LER to CTC on the same data when implemented on CPU, and can be faster for longer sequences.
On LibriSpeech, MFCC-based models reach around 6.9% LER and ~7.2% WER on dev-clean/test-clean respectively (best results reported).
Power spectrum and raw waveform inputs yield higher LER/WER than MFCCs but remain competitive, with improvements observed as data size increases.
Using data augmentation helps more with smaller training sets; with large data, MFCC and power-spectrum perform similarly.
The proposed end-to-end system operates without HMM/GMM force alignment and runs efficiently (e.g., decoding substantially faster than some baseline RNN-based systems).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.