QUICK REVIEW

[论文解读] Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Qian Zhang, Lu Han|arXiv (Cornell University)|Feb 7, 2020

Speech Recognition and Synthesis参考文献 24被引用 27

一句话总结

本文提出了一种流式端到端语音识别模型——Transformer Transducer，该模型在RNN-T架构中用基于自注意力机制的Transformer编码器替代了RNN编码器，实现了更快的训练速度和具有竞争力的准确率。通过仅使用有限的左文（10帧）和适度的右文（2帧），该模型在LibriSpeech test-clean上实现了2.4%的SOTA WER，实现了延迟与性能的良好平衡。

ABSTRACT

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

研究动机与目标

开发一种可流式处理的端到端语音识别模型，在保持高准确率的同时支持实时推理。
在RNN-T框架中用Transformer编码器替换基于RNN的编码器，以利用可并行化的自注意力机制并加快训练速度。
通过将自注意力感受野限制为固定数量的过去和未来帧，确保模型在流式处理中的计算可行性。
研究在基于Transformer的流式ASR系统中，识别准确率与推理延迟之间的权衡关系。
证明在适当约束下，音频和标签编码器中的自注意力机制均可实现SOTA性能。

提出的方法

将RNN-T模型中的RNN编码器替换为独立处理音频和标签序列的Transformer编码器，采用多头自注意力机制。
对音频编码器的自注意力机制应用因果掩码，仅允许关注过去和当前帧，从而实现每帧常数时间复杂度的推理。
使用固定左文（如10帧）并可选地使用有限右文（如2帧），以平衡延迟与性能。
使用标准RNN-T损失进行模型训练，该损失对声学帧与标签序列之间的所有可能对齐进行边缘化。
通过前馈网络将音频和标签编码器的输出进行融合，以在每个时间步预测下一个标签。
为提高效率，在所有Transformer层中使用共享掩码，尽管未来可考虑每层使用不同掩码的扩展方案。

Fig. 1 : RNN/Transformer Transducer architecture.

实验结果

研究问题

RQ1在保持低延迟的前提下，Transformer编码器能否在基于RNN-T的流式ASR模型中有效应用？
RQ2对音频和标签编码器中的自注意力感受野进行限制，如何影响识别准确率与推理速度？
RQ3基于Transformer的模型是否能在LibriSpeech上实现SOTA性能，同时适用于流式推理？
RQ4全注意力Transformer Transducer与具有上下文限制的流式版本之间的性能差距是多少？该差距是否可被弥合？
RQ5在注意力机制中使用多少过去和未来帧，会影响延迟与准确率之间的权衡？

主要发现

全注意力Transformer Transducer在LibriSpeech test-clean上实现了2.4%的新SOTA WER，在test-other上达到5.6%，优于现有模型。
对音频编码器使用10帧左文可将每帧推理时间复杂度降至常数，实现实用的流式处理，且与全注意力模型相比，test-clean上的WER仅增加1.2%。
每层增加2帧右文可使test-clean的WER从4.2%降至3.6%，test-other的WER从11.3%降至10.0%，显著缩小与全注意力模型的差距。
将标签编码器限制为仅3个先前标签状态，即可达到与使用20个状态相当的性能，表明标签建模仅需极小的左文即可。
由于自注意力操作具有可并行化特性，该模型的训练速度显著快于基于LSTM的RNN-T模型。
通过关注有限数量的未来帧，可弥合流式模型（10帧左文）与全注意力模型之间的性能差距，6帧右文可使WER差距缩小16%。

Fig. 2 : Transformer encoder architecture.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。