QUICK REVIEW

[论文解读] SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization

Jialong Guo, Xinghao Chen|arXiv (Cornell University)|May 19, 2024

Advanced Memory and Neural Computing被引用 7

一句话总结

本文提出 SLAB，通过渐进地用可重新参数化的 BatchNorm（RepBN）替代 LayerNorm 以及采用简化线性注意力（SLA），在保证视觉与语言任务精度的同时，降低延迟，从而构建更高效的 transformers。

ABSTRACT

Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.

研究动机与目标

识别 Transformer 归一化与注意力模块中的计算瓶颈。
提出 Progressive Re-parameterized BatchNorm (PRepBN)，在推理阶段用稳定的 BN 替代 LayerNorm。
开发简化线性注意力（SLA），在降低计算成本的同时保持性能。
在图像分类、目标检测和语言建模任务上展示 SLAB 的效率。
证明 PRepBN 与 SLA 能在与 LN 基或先前线性注意力 Transformer 相当或更高准确度的前提下实现更低延迟。

提出的方法

在训练过程中渐进性地用 RepBN 替换 LayerNorm，利用一个随训练步数衰减的 gamma 将 Transformer 从 LN 主导转换为基于 BN 的结构。
引入 RepBN(X) = BN(X) + eta X，其中 eta 为可学习参数，实现训练后对标准 BN 的重新参数化。
给出一个重新参数化引理，将 RepBN 转换为推理时的标准 BN 形式。
定义渐进 LN -> RepBN：PRepBN(X) = gamma*LN(X) + (1-gamma)*RepBN(X)，其中 gamma 在训练步骤中衰减。
提出 SLA，其中 Sim_SLA(Qi, Kj) = ReLU(Qi) ReLU(Kj)^T，然后进行归一化聚合并通过深度卷积实现局部增强。
证明 SLA 的线性时间复杂度和对硬件友好的运算，在注意力多样性方面与注意力图中的可视化一致。

实验结果

研究问题

RQ1将 LayerNorm 替换为渐进训练的 RepBN 能否在不牺牲准确性的前提下降低推理延迟？
RQ2提出的 SLA 是否在提供更低计算成本的同时达到或超过现有线性注意力方法的性能？
RQ3在视觉任务和语言建模中，PRepBN 与 SLA 如何在不同的 Transformer 骨干（DeiT、PVT、Swin）之间交互？
RQ4在标准基准测试中，将 PRepBN 与 SLA 结合时，具体的准确性-延迟权衡是什么？

主要发现

方法	FLOPs	延迟 (ms)	Top-1 准确率 (%)
Flatten-DeiT-T	1.1 G	15.2	74.1%
SLAB-DeiT-T	1.1 G	9.6	74.3%
Flatten-DeiT-S	4.4 G	15.5	80.4%
SLAB-DeiT-S	4.4 G	10.4	80.0%
Flatten-PVT-T	2.0 G	10.8	77.8%
SLAB-PVT-T	2.0 G	8.0	76.5%
Flatten-CSwin-T	4.3 G	32.4	83.1%
SLAB-CSwin-T	4.3 G	29.3	82.8%
Flatten-Swin-T	4.5 G	10.9	82.1%
SLAB-Swin-T	4.5 G	8.7	81.8%
Flatten-Swin-S	8.8 G	18.6	83.5%
SLAB-Swin-S	8.7 G	16.2	83.6%

采用渐进 LN 转换的 PRepBN 能提高准确性并使 BN 基 Transformer 的推理延迟更低。
SLAB-Swin-T 在 ImageNet-1K 上达到 83.6% 的 Top-1 准确率，延迟为 16.2 ms，比 Flatten-Swin 的延迟低 2.4 ms，且精度略高。
SLA 在各骨干网络上显著降低延迟，同时保持与 Flatten Transformer 相当的准确度。
在多种骨干网络上，SLAB 变体显示出更高吞吐量或在更低延迟下达到可比的准确率（例如：SLAB-DeiT-T 74.3% Top-1，在 9.6 ms；Flatten-DeiT-T 74.1% 在 15.2 ms）。
在语言建模和 LLaMA-350M 实验中，PRepBN 在保持类似困惑度的同时实现更低的推理延迟和更高的吞吐量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。