[논문 리뷰] Separable Self-attention for Mobile Vision Transformers
논문은 MobileViT에서 표준 다중_HEAD self-attention을 대체하기 위해 선형 O(k) 복잡도의 분리 가능 셀프 어텐션을 도입하고, 유사한 정확도와 함께 모바일 기기에서 추론 속도를 높인다.
Mobile vision transformers (MobileViT) can achieve state-of-the-art performance across several mobile vision tasks, including classification and detection. Though these models have fewer parameters, they have high latency as compared to convolutional neural network-based models. The main efficiency bottleneck in MobileViT is the multi-headed self-attention (MHA) in transformers, which requires $O(k^2)$ time complexity with respect to the number of tokens (or patches) $k$. Moreover, MHA requires costly operations (e.g., batch-wise matrix multiplication) for computing self-attention, impacting latency on resource-constrained devices. This paper introduces a separable self-attention method with linear complexity, i.e. $O(k)$. A simple yet effective characteristic of the proposed method is that it uses element-wise operations for computing self-attention, making it a good choice for resource-constrained devices. The improved model, MobileViTv2, is state-of-the-art on several mobile vision tasks, including ImageNet object classification and MS-COCO object detection. With about three million parameters, MobileViTv2 achieves a top-1 accuracy of 75.6% on the ImageNet dataset, outperforming MobileViT by about 1% while running $3.2 imes$ faster on a mobile device. Our source code is available at: \url{https://github.com/apple/ml-cvnets}
연구 동기 및 목표
- Motivate and address the latency bottleneck of multi-headed self-attention (MHA) in vision transformers for mobile devices.
- Propose a separable self-attention mechanism with linear complexity and element-wise operations.
- Integrate the separable self-attention into MobileViT to form MobileViTv2.
- Demonstrate improved inference speed while maintaining or improving accuracy on ImageNet-1k, MS-COCO, and segmentation benchmarks.
제안 방법
- Replace the MHA in MobileViT with separable self-attention that computes context scores with respect to a latent token L, reducing complexity from O(k^2) to O(k).
- Compute context scores via a single latent projection I using W_I, followed by a softmax to produce c_s, then obtain a context vector c_v as a weighted sum of K-projected tokens with W_K.
- Propagate context information to V via broadcasted element-wise multiplication with ReLU(xW_V) and a final linear projection W_O to produce output y.
- Describe separable self-attention as y = ( sum( sigma(xW_I) * (xW_K) ) * ReLU(xW_V) ) W_O, highlighting element-wise operations.
- Integrate the separable self-attention into MobileViTv2 by replacing MHA in MobileViT, and explore model widths using a multiplier alpha in {0.5, 2.0}.
실험 결과
연구 질문
- RQ1Can self-attention be reformulated to linear complexity while preserving performance on mobile vision tasks?
- RQ2Does separable self-attention maintain competitive accuracy while significantly reducing latency on mobile devices?
- RQ3How does MobileViTv2 with separable self-attention compare to MobileViTv1 and other mobile vision models across classification, detection, and segmentation tasks?
주요 결과
| 주의 단위 | 지연 시간 (ms) | Top-1 (%) |
|---|---|---|
| Self-attention in Transformer (Fig. 3(a)) | 9.9 | 78.4 |
| Self-attention in Linformer (Fig. 3(b)) | 10.2 | 78.2 |
| Separable self-attention (Ours; Fig. 3(c)) | 3.4 | 78.1 |
- MobileViTv2 achieves ~3x faster inference than MobileViTv1 on mobile devices while maintaining similar accuracy on ImageNet-1k.
- On ImageNet-1k, MobileViTv2 with separable self-attention matches MobileViTv1 accuracy (within ~0.1%) and improves latency to 3.4 ms for the attention unit (vs 9.9–10.2 ms for baselines).
- MobileViTv2 delivers competitive or better performance in MS-COCO object detection and ADE20k/PASCAL VOC segmentation when using the separable self-attention backbone.
- Experiments show separable self-attention reduces attention-related latency with negligible loss in Top-1 accuracy compared to Transformer/Linformer baselines on mobile hardware.
- Across tasks, MobileViTv2 tightens the latency gap between CNNs and ViTs on mobile devices while preserving or improving accuracy.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.