QUICK REVIEW

[論文レビュー] Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Ching-Feng Yeh, Jay Mahadeokar|arXiv (Cornell University)|Oct 28, 2019

Speech Recognition and Synthesis参考文献 22被引用数 66

ひとこと要約

本論文は neural transducers における LSTM エンコーダを Transformer ベースのエンコーダ（VGG-Transformer）へ置換し、因果畳込みと切り詰め自己注意を用いてストリーミング推論を実現。コンパクトなモデルで LibriSpeech において競争力のある WER を達成する。

ABSTRACT

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.

研究の動機と目的

エンドツーエンド ASR のために neural transducers で Transformer ネットワークを用いる動機づけ。
因果畳込みと VGGNet ブロックを用いて位置情報を取り入れ、フレームレートを低減する。
計算量を管理しつつ、切り詰め自己注意機構を用いてストリーミング推論を可能にする。

提案手法

ニューラルトランスデューサフレームワークにおいて、マルチヘッド自己注意を備えた Transformer エンコーダをエンコーダとして採用する。
因果畳込み（VGGNet スタイル）を用いて位置情報を注入し、フレームレートを低減する。
ストリーミングと O(T) 複雑性のために、固定ウィンドウ内に文脈を制限する切り詰め自己注意を適用する。
固定パラメータ予算の下で、異なるエンコーダ／予測子の組み合わせを実験する。
LibriSpeech 上で 80-dim log-MEL 特徴量と SpecAugment を用いて評価し、test-clean と test-other の WER を報告する。

実験結果

リサーチクエスチョン

RQ1Can Transformer-based encoders outperform LSTMs in neural transducers for ASR?
RQ2Does causal convolution plus VGGNet improve positional encoding and efficiency in Transformer-based transducers?
RQ3How does truncated self-attention affect streaming capability and WER trade-offs?
RQ4What is the optimal left/right context balance (L, R) for streaming performance without excessive latency?

主な発見

モデルアーキテクチャ	右文脈 R	左文脈 L	test-clean WER	test-other WER
Neural Transducer (encoder: LSTM 5x1024; predictor: LSTM 2x700)	inf	0	12.31	23.16
Neural Transducer (encoder: BLSTM 4x640; predictor: LSTM 2x700)	inf	inf	6.85	16.90
Neural Transducer (encoder: Transformer 12x; predictor: LSTM 2x700)	inf	inf	6.08	13.89
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	inf	0	12.32	23.08
Neural Transducer (encoder: Transformer 12x; predictor: LSTM 2x700)	inf	4	6.99	16.88
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	8	4	6.47	15.79
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	16	4	6.57	15.92
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	32	4	6.37	15.30

Transformer 12x encoder with LSTM 2x700 predictor achieves competitive WER with reduced parameters (45.7M).
VGG-Transformer encoder with unlimited self-attention outperforms BLSTM baselines but is non-streamable.
With truncated self-attention (L, R) = (16, 4) or (32, 4), the model attains strong WER while enabling streaming and maintaining O(T) complexity.
Best trade-off found: (L, R) = (32, 4) yields test-clean 6.37% and test-other 15.30% WER under streaming constraints.
Overall, Transformer-Transducer achieves 6.37%/15.30% WER on LibriSpeech with 45.7M parameters and linear-time inference.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。