QUICK REVIEW

[论文解读] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Sehoon Kim, Amir Gholami|arXiv (Cornell University)|Jun 2, 2022

Speech Recognition and Synthesis被引用 75

一句话总结

Squeezeformer 通过时域 U-Net 宏架构和简化微架构重新设计 Conformer，在 LibriSpeech test-other 上实现无外部语言模型且 FLOPs 可比的情况下达到最先进的 WER。

ABSTRACT

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

研究动机与目标

通过解决端到端ASR的效率和准确性来推进改进相对于 Conformer。
系统性研究宏观和微观架构选择以减少计算量并提高性能。
提出一个更简单、更加高效的混合注意力-卷积骨干用于 ASR。
在不同模型规模和 FLOPs 下展示可扩展的性能且无需外部语言模型。

提出的方法

在编码器内引入 Temporal U-Net 以对表征进行下采样后再上采样。
采用 Transformer 风格的 MF/CF 块结构，去除 Macaron 设计以及前后 MHA/卷积顺序。
通过用 Swish 替换卷积模块中的 GLU 来统一激活函数。
用一个可学习的缩放层和后置 LN 替换冗余的前置层归一化，从而在推理时实现零成本融合。
用深度可分离下采样层替换初始下采样卷积以降低 FLOPs。
在相同训练设置下训练并比较多种模型尺寸（XS、S、SM、M、ML、L），且不使用外部语言模型。

实验结果

研究问题

RQ1在 ASR 中，是否可以在不牺牲准确性的前提下简化 Conformer 风格的设计选择？
RQ2哪些宏观架构改动（如时域下采样）可以降低注意力成本并提高稳定性？
RQ3哪些微观架构细化（激活、归一化、下采样）能带来更好的 WER 和效率？
RQ4在相似 FLOPs 下，Squeezeformer 的变体是否始终优于 Conformer 及其他基线？

主要发现

Squeezeformer 在 LibriSpeech test-other 上在无外部 LMs 的条件下实现 7.5%、6.5% 和 6.0% 的 WER，相同 FLOPs 下比 Conformer-CTC 提前 1.4–3.1% 的 WER。
时域 U-Net 下采样降低注意力成本并提高稳定性，在注意力上实现高达 2.31–2.53× 的 FLOPs 减少，同时获得更好的 WER。
统一的 Swish 激活和缩放后的后 LN 提高了训练稳定性，在 test-other 的各变体中 WER 提升约 0.2–0.7%。
深度可分离下采样显著降低 FLOPs（在下采样中约下降 28%），吞吐量提升至约 1.34×，且不损失 WER。
Squeezeformer-SM 和 Squeezeformer-M 在可比 FLOPs 和规模下优于 Conformer 基线，在若干设置下实现了最先进的结果。
消融研究证实 Temporal U-Net 跳连、可学习的缩放层和 Swish 激活对于达到最佳性能是必要的。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。