QUICK REVIEW

[论文解读] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

Youness Dkhissi, Valentin Vielzeuf|arXiv (Cornell University)|Feb 17, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

SENS-ASR 通过使用从句子嵌入教师提炼的上下文模块，将语义上下文注入流式ASR 的帧表述中，在小块流式片段下提高了未进行外部再评分的WER表现。

ABSTRACT

Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.

研究动机与目标

推动在未来上下文有限时的流式ASR，以免转录质量下降。
提出一个上下文模块，将语义信息注入编码器帧嵌入。
利用句子嵌入模型的知识蒸馏来引导语义上下文。
在小块流式场景中跨数据集展示语义丰富性对WER的提升。
在全上下文音频通过 Dynamic Chunk Training 保持有竞争力的性能。

提出的方法

在RNN-T 架构中扩展一个专用的 Context Module。
通过从目标域微调的 Sentence Embedding Model 进行知识蒸馏来训练 Context Module。
使用注意力池化从过去的帧嵌入中生成语义上下文嵌入。
在 joint network 之前，将块级语义上下文与每个帧嵌入拼接。
用 L_RNN-T 加上蒸馏损失 L_MSE 进行优化；将 alpha 调整为 0.2；采用 FastEmit 正则化。
采用 Dynamic Chunk Training 在训练时让模型暴露于多样的上下文长度。

实验结果

研究问题

RQ1在小块流式约束下，将语义上下文注入流式 ASR 的帧嵌入是否能降低 WER？
RQ2在 LibriSpeech 和 TEDLIUM-2 上，使用不同的块大小时，SENS-ASR 与基线 RNN-T 的性能对比如何？
RQ3是否可以通过蒸馏引导的语义教师改进编码器表示，而 inference 时不需要外部再评分？
RQ4通过基于改述的领域适应对句子嵌入教师进行微调，对下游 ASR 的影响如何？
RQ5在具有自发语音的数据集（LibriSpeech test-clean/test-other 与 TEDLIUM-2）下，流式条件下所提方法的鲁棒性如何？

主要发现

SENS-ASR 在小块（如 160 ms、320 ms）下相较基线，在多个数据集上降低了 WER。
在 LibriSpeech test-clean，160 ms 块：WER 从 Baseline 的 7.55 提升到 SENS-ASR 的 7.21。
在 LibriSpeech test-clean，1280 ms 块：WER 从 Baseline 的 3.49 提升到 SENS-ASR 的 3.44；全上下文相近。
在 LibriSpeech test-other，160 ms 块：WER 从 Baseline 的 18.34 提升到 SENS-ASR 的 17.89。
在 TEDLIUM-2，160 ms 块：WER 从 Baseline 的 16.52 提升到 SENS-ASR 的 15.60。
SENS-ASR 在 LibriSpeech test-clean 上对到最新的流式 ASR 模型呈现竞争性结果，有时在较大块或全上下文基线水平接近，同时通过 Dynamic Chunk Training 进行训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。