QUICK REVIEW

[論文レビュー] End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Gabriel Synnaeve, Qiantong Xu|arXiv (Cornell University)|Nov 19, 2019

Speech Recognition and Synthesis参考文献 54被引用数 166

ひとこと要約

この論文は、ResNet、Time-Depth Separable ConvNets、Transformer 音響モデルを跨ぐ pseudo-labeling ベースの半教師あり学習を用いたエンドツーエンドの ASR を検討し、ラベルなし LibriVox データで新しい最先端の結果を達成するとともに、豊富なラベルなし音声が利用可能な場合に外部言語モデルへの依存を低減することを示しています。

ABSTRACT

We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.

研究の動機と目的

Motivate and evaluate end-to-end ASR performance using CTC and Seq2Seq losses across diverse architectures.
Assess the impact of semi-supervised learning via pseudo-labeling on model performance when unlabeled data is available.
Characterize how unlabeled audio affects dependence on external language models during decoding.
Demonstrate state-of-the-art end-to-end results on LibriSpeech with and without external language models.
Provide insights into the how different architectures benefit from semi-supervised data.

提案手法

Train multiple end-to-end acoustic models (ResNet, Time-Depth Separable ConvNets, and Transformer) with CTC or Seq2Seq losses.
Use LibriSpeech as labeled data and LibriVox as unlabeled data, generating pseudo-labels with a Transformer AM and decoding with a language model.
Construct and train language models (n-gram, GCNN, and Transformer) on LibriSpeech text with careful data filtering to avoid overlap with unlabeled audio.
Perform one-pass beam-search decoding with external LMs and optional rescoring to obtain final transcriptions.
Evaluate on LibriSpeech dev/test sets, reporting WER with and without decoding/LM, and with pseudo-labeled data.

実験結果

リサーチクエスチョン

RQ1Can pseudo-labeling on large unlabeled datasets bridge the performance gap between different end-to-end architectures (ResNet, TDS, Transformer) and losses (CTC, Seq2Seq) in ASR?
RQ2How does increasing unlabeled data (LibriVox) affect WER across architectures and the reliance on external language models?
RQ3What is the relative contribution of decoding and LM rescoring in semi-supervised settings for end-to-end ASR?
RQ4Can end-to-end models trained with semi-supervised data achieve state-of-the-art results on LibriSpeech without external LMs, and how do results compare when LMs are used?
RQ5What are the effects of pseudo-labeling depending on whether the language model used for labeling overlaps with the unlabeled corpus?

主な発見

AM / モデル	LM の型	Dev-clean WER	Dev-other WER	Test-clean WER	Test-other WER
Transformer (LibriVox, デコードなし)	-	2.28%	4.88%	-	-
Transformer (LibriVox, デコード+再スコアリング)	4-gram LM + GCNN/Transformer 再スコアリング	-	-	2.09%	4.11%
Transformer (LibriSpeech のみ、デコード+再スコアリング)	LibriSpeech 上の LM	-	-	5.17%	-
Transformer (LibriSpeech + LibriVox, デコードなし)	-	-	-	-	-

Semi-supervised pseudo-labeling improves all architectures (ResNet, TDS, Transformer) across CTC and Seq2Seq losses.
With LibriVox unlabeled data, Transformer models reach 2.28% WER on test-clean and 4.88% on test-other without decoding or LM; decoding with LM reduces these to 2.09% and 4.11%, respectively.
End-to-end Transformer models achieve 5.17% WER on test-other with decoding and rescoring after training on LibriSpeech alone, and 6.98% WER without decoding.
Models trained with LibriVox pseudo-labels require less reliance on external LMs during decoding, evidenced by smaller gains from LM rescoring when unlabeled data is plentiful.
Training on LibriVox pseudo-labels alone yields competitive results, e.g., 2.38% dev-clean and 5.43% dev-other with a Transformer AM trained on LibriVox labels only (compared to LibriSpeech baseline 2.99% / 7.31%).
Increasing the amount of pseudo-labeled audio consistently improves WER; the full LibriVox-augmented setup yields the best reported results.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。