QUICK REVIEW

[论文解读] Separable Self-attention for Mobile Vision Transformers

Sachin Mehta, Mohammad Rastegari|arXiv (Cornell University)|Jun 6, 2022

Advanced Neural Network Applications被引用 182

一句话总结

本论文引入了具有线性 O(k) 复杂度的可分离自注意力，用以替代 MobileViT 中的标准多头自注意力，从而在移动设备上实现更快的推理，同时准确性相近。

ABSTRACT

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2 imes$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

研究动机与目标

动机并解决在面向移动设备的视觉变换器中多头自注意力（MHA）引入的延迟瓶颈。
提出一种具有线性复杂度和逐元素运算的可分离自注意力机制。
将可分离自注意力集成到 MobileViT 中，形成 MobileViTv2。
在 ImageNet-1k、MS-COCO 和分割基准上展示在保持或提升准确性的同时实现推理速度的提升。

提出的方法

用可分离自注意力替换 MobileViT 中的多头自注意力，它相对于隐含令牌 L 计算上下文分数，将复杂度从 O(k^2) 降至 O(k)。
通过使用 W_I 的单一潜在投影 I 计算上下文分数，随后进行 softmax 以产生 c_s；再通过对 K 投影的标记用 W_K 进行加权求和，得到上下文向量 c_v。
通过对 ReLU(xW_V) 进行逐元素的广播乘法，将上下文信息传播到 V ，并通过最终的线性投影 W_O 产生输出 y。
将可分离自注意力描述为 y = ( sum( sigma(xW_I) * (xW_K) ) * ReLU(xW_V) ) W_O，强调逐元素运算。
通过在 MobileViT 中替换 MHA，将可分离自注意力集成到 MobileViTv2，并使用乘数 alpha 在 {0.5, 2.0} 探索模型宽度。

实验结果

研究问题

RQ1自注意力是否可以重构为线性复杂度，同时在移动视觉任务上保持性能？
RQ2可分离自注意力在显著降低移动设备延迟的同时，是否能够保持具有竞争力的准确性？
RQ3具有可分离自注意力的 MobileViTv2 与 MobileViTv1 以及其他移动视觉模型在分类、检测和分割任务上的比较如何？

主要发现

Attention unit	Latency (ms)	Top-1 (%)
Transformer 的自注意力（图 3(a)）	9.9	78.4
Linformer 的自注意力（图 3(b)）	10.2	78.2
可分离自注意力（我们的方法；图 3(c)）	3.4	78.1

MobileViTv2 在移动设备上比 MobileViTv1 提供约 3 倍的推理速度，同时在 ImageNet-1k 上保持相似的准确性。
在 ImageNet-1k 上，具有可分离自注意力的 MobileViTv2 的准确度与 MobileViTv1 相当（在约 0.1% 时间内），并将注意力单元的延迟降低至 3.4 ms（基线为 9.9–10.2 ms）。
在使用可分离自注意力骨干网络时，MobileViTv2 在 MS-COCO 目标检测以及 ADE20k/PASCAL VOC 分割方面表现具有竞争力或更好。
实验表明，与 Transformer/Linformer 基线在移动硬件上相比，可分离自注意力降低了与注意力相关的延迟，且 Top-1 精度损失可忽略。
在各任务中，MobileViTv2 缩小了移动设备上 CNNs 与 ViTs 之间的延迟差距，同时保持或提升准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。