[논문 리뷰] RMT: Retentive Networks Meet Vision Transformers
이 논문은 Manhattan 거리 기반 공간 감쇠 self-attention인 MaSA를 도입하여 명시적 공간 편향을 갖춘 RMT를 구현하고, 시각 백본에서의 선형 복잡도와 ImageNet에서의 최첨단 성능을 달성하며 탐지 및 분할 작업 전반에서 강력한 성능을 보인다.
Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial domain, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is available at https://github.com/qhfan/RMT
연구 동기 및 목표
- 데이터 효율성을 개선하고 이차 복잡도를 줄이기 위해 비전 트랜스포머(ViT)에서 명시적 공간 편향의 필요성을 제시한다.
- Self-Attention에 통합된 Manhattan 거리 기반 공간 감쇠 메커니즘인 MaSA를 제안한다.
- 공간 편향을 해치지 않으면서 선형 복잡도로 전역 정보를 모형화하기 위한 분해된 어텐션 공식화를 개발한다.
- MaSA를 사용하여 단일 백본(RMT)을 구성하고 분류, 탐지, 분할 작업 전반에서 강력한 성능을 보임을 시연한다.
제안 방법
- Extend RetNet's temporal decay to two-dimensional spatial decay using a Manhattan distance-based matrix D^{2d}.
- Define MaSA as Softmax(QK^T) ⊙ D^{2d} followed by V, introducing explicit spatial priors into Self-Attention.
- Introduce a decomposed MaSA (Attn_H and Attn_W) that applies one-dimensional bidirectional decay along horizontal and vertical axes to retain linear complexity while preserving the spatial prior.
- Add Local Context Enhancement (LCE) and Convolutional Position Encoding (CPE) to boost local expressions and positional information.
- Construct a four-stage vision backbone (RMT) using decomposed MaSA in the first three stages and MaSA in the last stage, with a convolutional stem and standardized training settings.
실험 결과
연구 질문
- RQ1Can explicit spatial priors be integrated into Vision Transformers to improve accuracy without sacrificing efficiency?
- RQ2How can a two-dimensional spatial decay mechanism be designed and integrated with Self-Attention while maintaining linear complexity?
- RQ3Does decomposing MaSA into horizontal and vertical attention preserve the receptive field and prior information while reducing compute?
- RQ4Do MaSA-based backbones (RMT) outperform state-of-the-art ViT backbones on ImageNet and downstream vision tasks (Detection, Segmentation) without extra data?
- RQ5What is the impact of auxiliary components (LCE, CPE, convolution stem) on overall performance?
주요 결과
| 모델 | Params (M) | FLOPs (G) | Top1-acc (%) |
|---|---|---|---|
| RMT-S | 27 | 4.5 | 84.1 |
| RMT-S* | 27 | 4.5 | 84.8 |
| RMT-B | 54 | 9.7 | 85.0 |
| RMT-B* | 55 | 9.7 | 85.6 |
| RMT-L | 95 | 18.2 | 85.5 |
| RMT-L* | 96 | 18.2 | 86.1 |
| RMT-T | 14 | 2.5 | 82.4 |
- RMT achieves state-of-the-art or competitive top-1 accuracy on ImageNet-1K across model scales with favorable FLOPs (e.g., RMT-S: 84.1% Top-1 at 4.5 GFLOPs; RMT-S*: 84.8%; RMT-L*: 86.1%).
- RMT-L outperforms MaxViT-B while using fewer FLOPs, and RMT-B* reaches 85.6% Top-1 accuracy.
- In downstream tasks, RMT attains 54.5 box AP and 47.2 mask AP on COCO, and 52.8 mIoU on ADE20K, demonstrating strong capabilities in detection, segmentation, and semantic segmentation.
- The decomposed MaSA (Attn_H and Attn_W) preserves the spatial prior while achieving linear complexity for global modeling.
- Ablation studies show MaSA improves performance over vanilla attention by about 0.8% in classification, and LCE/CPE/stem contribute incremental gains.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.