QUICK REVIEW

[论文解读] Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Ching-Feng Yeh, Jay Mahadeokar|arXiv (Cornell University)|Oct 28, 2019

Speech Recognition and Synthesis参考文献 22被引用 66

一句话总结

该论文用基于 Transformer 的编码器（VGG-Transformer）替代神经传输中的 LSTM 编码器，使用因果卷积和截断自注意力实现流式端到端语音识别，在 LibriSpeech 上以紧凑模型实现有竞争力的 WER。

ABSTRACT

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.

研究动机与目标

在神经传输器中使用 Transformer 网络以实现端到端 ASR 的动机。
通过因果卷积和 VGGNet 块引入位置信息并降低帧率。
通过截断自注意力实现流式推理，同时控制计算复杂度。

提出的方法

在神经传输器框架中采用带多头自注意力的 Transformer 编码器作为编码器。
使用因果卷积（VGGNet 风格）来引入位置信息并降低帧率。
应用截断自注意力，将上下文限制在固定窗口内以实现流式处理并获得 O(T) 复杂度。
在固定参数预算下实验不同的编码器/预测器配对。
在 LibriSpeech 上使用 80-dim log-MEL 特征和 SpecAugment 进行评估，并报告 test-clean 和 test-other 的 WER。

实验结果

研究问题

RQ1基于 Transformer 的编码器是否能在用于 ASR 的神经传输器中超越 LSTM？
RQ2因果卷积加 VGGNet 是否能在 Transformer 基变换器中改善位置编码和效率？
RQ3截断自注意力如何影响流式能力与 WER 的权衡？
RQ4在不产生过多延迟的前提下，流式性能的最优左/右上下文平衡 (L, R) 是多少？

主要发现

模型架构	右上下文 R	左上下文 L	test-clean WER	test-other WER
Neural Transducer (encoder: LSTM 5x1024; predictor: LSTM 2x700)	inf	0	12.31	23.16
Neural Transducer (encoder: BLSTM 4x640; predictor: LSTM 2x700)	inf	inf	6.85	16.90
Neural Transducer (encoder: Transformer 12x; predictor: LSTM 2x700)	inf	inf	6.08	13.89
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	inf	0	12.32	23.08
Neural Transducer (encoder: Transformer 12x; predictor: LSTM 2x700)	inf	4	6.99	16.88
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	8	4	6.47	15.79
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	16	4	6.57	15.92
Neural Transducer (encoder: Transformer 12x; predictor: Transformer 6x)	32	4	6.37	15.30

Transformer 12x 编码器搭配 LSTM 2x700 预测器在参数量减少的情况下实现有竞争力的 WER。
VGG-Transformer 编码器在自注意力无限制的情况下优于 BLSTM 基线，但不可流式处理。
在截断自注意力 (L, R) = (16, 4) 或 (32, 4) 时，模型在实现流式并保持 O(T) 复杂度的同时获得较强的 WER。
最佳折衷为：(L, R) = (32, 4)，在流式约束下 test-clean 6.37% 和 test-other 15.30% 的 WER。
总体而言，Transformer-Transducer 在 LibriSpeech 上以 45.7M 参数实现 6.37%/15.30% WER，且推理时间线性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。