QUICK REVIEW

[论文解读] Zipformer: A faster and better encoder for automatic speech recognition

Zengwei Yao, Liyong Guo|arXiv (Cornell University)|Oct 17, 2023

Speech Recognition and Synthesis被引用 28

一句话总结

Zipformer 引入了一种更快速、内存高效的 ASR 编码器，具有类似 U-Net 的下采样结构、BiasNorm、Swoosh 激活函数和 ScaledAdam，在 LibriSpeech、Aishell-1 和 WenetSpeech 上取得了最先进的结果。

ABSTRACT

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

研究动机与目标

推动端到端 ASR 系统中更快和更省内存的编码器的需求。
提出带有时间下采样并重复使用注意力权重以提升效率的 Zipformer 架构。
引入 BiasNorm、SwooshR/SwooshL 激活，以及 ScaledAdam 优化器以改进训练和推理。
在 LibriSpeech、Aishell-1 和 WenetSpeech 上评估 Zipformer，并进行消融研究以理解组件贡献。

提出的方法

提出一个类似 U-Net 的编码器，包含多级堆叠以将输入下采样至逐步降低的帧率。
采用重新设计的 Zipformer 块，通过扩展模块集（包括 Non-Linear Attention (NLA) 和 ByPass 连接）实现对注意力权重的重复使用。
用 BiasNorm 替代 LayerNorm，以在归一化过程中保留长度信息。
引入两种针对不同模块需要的激活函数（SwooshR 和 SwooshL）。
开发 ScaledAdam，这是一个具备尺度感知的优化器，学习参数尺度并按参数 RMS 放大更新，从而实现更快的收敛。
在 LibriSpeech、Aishell-1 和 WenetSpeech 上提供广泛的实验和消融研究，并与最先进模型进行比较。

Figure 1: Overall architecture of Zipformer.

实验结果

研究问题

RQ1如何在不牺牲准确性的前提下，使端到端 ASR 的编码器架构更快且更省内存？
RQ2Zipformer 块中的时间下采样和注意力权重共享是否能提升效率与性能？
RQ3归一化和激活函数的选择（BiasNorm、SwooshR、SwooshL）是否提升训练稳定性和准确性？
RQ4尺度感知的优化器（ScaledAdam）是否在 Zipformer 模型的训练中优于 Adam？
RQ5Zipformer 在 LibriSpeech、Aishell-1 和 WenetSpeech 上的性能与最先进模型相比如何？

主要发现

Zipformer-S/M/L 在 LibriSpeech、Aishell-1 和 WenetSpeech 上具有与最先进结果相当的性能，同时降低了 FLOPs 和参数量。
Zipformer-L 和 Zipformer-L* 在 LibriSpeech 上的WER 接近 Conformer-L，且 FLOPs 与内存使用量约为原来的一半。
Zipformer 在训练和推理中提供更快的收敛速度，在 GPU 上的加速超过 50%，且内存使用不过高。
消融研究表明：下采样、共享注意力权重、BiasNorm、Swoosh 激活和 ScaledAdam 对性能和效率均有积极贡献。
ScaledAdam 在 LibriSpeech 的收敛和最终 WER/CER 上优于 Adam，在 test-clean 和 test-other 指标上有显著提升。

Zipformer: A faster and better encoder for automatic speech recognition

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。