QUICK REVIEW

[論文レビュー] DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Shaoshi Ling, Yuzong Liu|arXiv (Cornell University)|Dec 11, 2020

Speech Recognition and Synthesis参考文献 34被引用数 58

ひとこと要約

DeCoAR 2.0 は Transformer エンコーダと多様性目的を備えたベクトル量子化層を用いて、ラベル付きデータが限られている場合でも基準と競合する WER を達成し、半教師付き音声認識のための深く文脈化された音響表現を学習する。

ABSTRACT

Recent success in speech representation learning enables a new way to leverage unlabeled data to train speech recognition model. In speech representation learning, a large amount of unlabeled data is used in a self-supervised manner to learn a feature representation. Then a smaller amount of labeled data is used to train a downstream ASR system using the new feature representations. Based on our previous work DeCoAR and inspirations from other speech representation learning, we propose DeCoAR 2.0, a Deep Contextualized Acoustic Representation with vector quantization. We introduce several modifications over the DeCoAR: first, we use Transformers in encoding module instead of LSTMs; second, we introduce a vector quantization layer between encoder and reconstruction modules; third, we propose an objective that combines the reconstructive loss with vector quantization diversity loss to train speech representations. Our experiments show consistent improvements over other speech representations in different data-sparse scenarios. Without fine-tuning, a light-weight ASR model trained on 10 hours of LibriSpeech labeled data with DeCoAR 2.0 features outperforms the model trained on the full 960-hour dataset with filterbank features.

研究の動機と目的

ラベルなしの音声データを活用して ASR のための堅牢な音響表現を学習する。
LSTMs を Transformer に置換し、ベクトル量子化を追加することで表現品質を向上させる。
再構成損失と多様性目的を組み合わせて離散的な音声表現を訓練する。
データが乏しい半教師付き ASR の状況で有効性を示す。
下流の ASR 性能に対する VQ モジュールの影響を分析する。

提案手法

Encoder: 1D conv layer followed by Transformer blocks to produce latent z representations (masked frame strategy).
Quantization: multiple codebooks with Gumbel-Softmax and straight-through estimator map z to a quantized v using discrete codewords.
Reconstruction: a feed-forward network reconstructs original frames from quantized representations with L1 loss.
Diversity loss: encourages uniform usage of codebook entries to promote informative linguistic units.
Joint objective: L = L_recon + alpha * L_div to train the model.
Semi-supervised downstream: freeze encoder after pretraining; attach to downstream ASR model with no encoder fine-tuning; use CTC loss for ASR.

実験結果

リサーチクエスチョン

RQ1Can Transformer-based encoders with vector quantization produce robust, contextualized acoustic representations from unlabeled data?
RQ2Does combining reconstruction loss with a diversity loss improve ASR performance in low-resource labeled-data regimes?
RQ3How does DeCoAR 2.0 compare to other representation learning approaches (e.g., wav2vec 2.0, VQ-APC) in semi-supervised LibriSpeech settings?
RQ4What is the impact of the VQ layer on downstream ASR accuracy in data-sparse scenarios?

主な発見

DeCoAR 2.0 with 10 hours of labeled data matches or surpasses systems trained on 960 hours with filterbank features in some conditions.
In extremely data-sparse scenarios, DeCoAR 2.0 achieves WER of 5.43% (test-clean) and 13.27% (test-other) with 10 hours of labeled data.
With 1 hour of labeled data, DeCoAR 2.0 achieves WER of 13.75% (test-clean) and 29.13% (test-other).
Ablation shows the VQ layer benefits ASR performance in the LibriSpeech 10-hour SSL setting (without VQ: 6.29/18.54 vs with VQ: 5.43/13.27).
DeCoAR 2.0 performs comparably to wav2vec 2.0 across semi-supervised scenarios.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。