QUICK REVIEW

[论文解读] Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Binbin Zhang, Di Wu|arXiv (Cornell University)|Dec 10, 2020

Speech Recognition and Synthesis参考文献 26被引用 46

一句话总结

本文提出 U2，一种统一的两阶段混合 CTC/注意力端到端模型，具备动态分块注意力，支持流式和非流式语音识别，在分块大小控制的同时实现显著的 CER 降低和可控延迟；注意力再评分提高了准确性和速度。

ABSTRACT

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. The same model achieves 5.42% CER with 640ms latency in a streaming ASR system.

研究动机与目标

在单一模型中推动将流式和非流式端到端 ASR 统一起来。
开发一个带有动态分块注意力的两阶段 CTC/注意力架构，以控制延迟。
通过使用组合的 CTC 和 AED 损失来简化训练，而无需复杂的 RNN-T 训练技巧。
证明该统一模型在 AISHELL-1 上能够实现有竞争力的流式和非流式性能。

提出的方法

采用带有共享编码器的混合 CTC/注意力架构，具有独立的 CTC 解码器和注意力解码器。
使用基于动态分块的注意力，以允许任意右上下文并通过分块大小控制推理延迟。
使用组合的 CTC 和 AED 损失进行训练以简化优化（L_combined = λ L_CTC + (1−λ)(L_AED-L + L_AED-R)）。
在 Conformer 编码器中使用因果卷积，使延迟与网络深度解耦。
在解码期间，生成流式的 n-best CTC 假设，并通过注意力解码器进行再评分以得到最终结果。
提供两阶段解码方案（CTC 第一遍带流式解码，随后进行基于注意力的再评分）以平衡延迟与准确性。
研究静态与动态分块训练以及动态分块调度策略，以统一流式与非流式模式。

实验结果

研究问题

RQ1单个模型是否可以在保持竞争力的准确性的同时同时支持流式和非流式 ASR？
RQ2动态分块注意力的代价如何影响推理中的延迟与准确性权衡？
RQ3对 CTC 生成的假设进行注意力再评分是否在实时性能上优于单独的自回归注意解码？
RQ4哪些训练策略（静态分块、动态分块、分块大小分布）能最好地统一流式与非流式行为？

主要发现

在 AISHELL-1 上，统一模型在非流式 ASR 上相对于标准非流式 Transformer 实现了 5.60% 的相对 CER 降低。
在流式模式下，同一模型在延迟 640 ms 时达到 CER 5.33%。
对 CTC 假设进行注意力再评分，在再评分阶段 CTC 权重为 0.5 时 CER 提升至 4.72（优于 CTC 前缀束搜索或单独自回归注意解码）。
注意力再评分的解码速度快于自回归注意解码，在所报道的设置中解码时间约加速 2.40 倍。
动态分块训练提供与静态分块训练相当的性能，并且在中等分块大小（例如 16/8/4）下可匹配或超越静态配置，同时实现延迟-准确度的权衡。
该方法在 AISHELL-1 上实现了最先进的流式准确率，并扩展到大型普通话数据集（15,000 小时实验），结果具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。