QUICK REVIEW

[논문 리뷰] Separable Self-attention for Mobile Vision Transformers

Sachin Mehta, Mohammad Rastegari|arXiv (Cornell University)|2022. 06. 06.

Advanced Neural Network Applications인용 수 182

한 줄 요약

논문은 MobileViT에서 표준 다중_HEAD self-attention을 대체하기 위해 선형 O(k) 복잡도의 분리 가능 셀프 어텐션을 도입하고, 유사한 정확도와 함께 모바일 기기에서 추론 속도를 높인다.

ABSTRACT

Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2 imes$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}

연구 동기 및 목표

Motivate and address the latency bottleneck of multi-headed self-attention (MHA) in vision transformers for mobile devices.
Propose a separable self-attention mechanism with linear complexity and element-wise operations.
Integrate the separable self-attention into MobileViT to form MobileViTv2.
Demonstrate improved inference speed while maintaining or improving accuracy on ImageNet-1k, MS-COCO, and segmentation benchmarks.

제안 방법

Replace the MHA in MobileViT with separable self-attention that computes context scores with respect to a latent token L, reducing complexity from O(k^2) to O(k).
Compute context scores via a single latent projection I using W_I, followed by a softmax to produce c_s, then obtain a context vector c_v as a weighted sum of K-projected tokens with W_K.
Propagate context information to V via broadcasted element-wise multiplication with ReLU(xW_V) and a final linear projection W_O to produce output y.
Describe separable self-attention as y = ( sum( sigma(xW_I) * (xW_K) ) * ReLU(xW_V) ) W_O, highlighting element-wise operations.
Integrate the separable self-attention into MobileViTv2 by replacing MHA in MobileViT, and explore model widths using a multiplier alpha in {0.5, 2.0}.

실험 결과

연구 질문

RQ1Can self-attention be reformulated to linear complexity while preserving performance on mobile vision tasks?
RQ2Does separable self-attention maintain competitive accuracy while significantly reducing latency on mobile devices?
RQ3How does MobileViTv2 with separable self-attention compare to MobileViTv1 and other mobile vision models across classification, detection, and segmentation tasks?

주요 결과

주의 단위	지연 시간 (ms)	Top-1 (%)
Self-attention in Transformer (Fig. 3(a))	9.9	78.4
Self-attention in Linformer (Fig. 3(b))	10.2	78.2
Separable self-attention (Ours; Fig. 3(c))	3.4	78.1

MobileViTv2 achieves ~3x faster inference than MobileViTv1 on mobile devices while maintaining similar accuracy on ImageNet-1k.
On ImageNet-1k, MobileViTv2 with separable self-attention matches MobileViTv1 accuracy (within ~0.1%) and improves latency to 3.4 ms for the attention unit (vs 9.9–10.2 ms for baselines).
MobileViTv2 delivers competitive or better performance in MS-COCO object detection and ADE20k/PASCAL VOC segmentation when using the separable self-attention backbone.
Experiments show separable self-attention reduces attention-related latency with negligible loss in Top-1 accuracy compared to Transformer/Linformer baselines on mobile hardware.
Across tasks, MobileViTv2 tightens the latency gap between CNNs and ViTs on mobile devices while preserving or improving accuracy.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.