QUICK REVIEW

[论文解读] RMT: Retentive Networks Meet Vision Transformers

Qihang Fan, Huaibo Huang|arXiv (Cornell University)|Sep 20, 2023

Advanced Neural Network Applications被引用 15

一句话总结

本论文提出 MaSA，一种基于曼哈顿距离的空间衰减自注意力，用以实现 RMT，即带有显式空间先验且线性复杂度的视觉骨干网络，在 ImageNet 上取得了最先进的结果，并在检测与分割任务中表现出色。

ABSTRACT

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial domain, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is available at https://github.com/qhfan/RMT

研究动机与目标

动机：在 Vision Transformers (ViT) 中明确空间先验的需求，以提升数据效率并降低二次复杂度。
提出 MaSA，一种基于曼哈顿距离的空间衰减机制，整合到自注意力中。
开发一个分解的注意力公式，在不干扰空间先验的前提下，以线性复杂度建模全局信息。
使用 MaSA 构建一个统一的骨干网络（RMT），并在分类、检测和分割任务中展示出色性能。

提出的方法

将 RetNet 的时间衰减扩展到二维空间衰减，使用基于曼哈顿距离的矩阵 D^{2d}。
将 MaSA 定义为 Softmax(QK^T) ⊙ D^{2d} 再乘以 V，在自注意力中引入显式的空间先验。
引入分解的 MaSA（Attn_H 和 Attn_W），沿水平和垂直轴应用一维双向衰减，以在保持线性复杂度的同时保留空间先验的信息。
添加 Local Context Enhancement (LCE) 和 Convolutional Position Encoding (CPE) 以提升局部表达和位置信息。
构建一个四阶段的视觉骨干（RMT），前三级阶段使用分解的 MaSA，最后一级使用 MaSA，配合卷积干线和标准化训练设置。

实验结果

研究问题

RQ1能否将显式空间先验整合到 Vision Transformers 以在不牺牲效率的前提下提升准确性？
RQ2如何设计并将二维空间衰减机制与自注意力集成，同时保持线性复杂度？
RQ3将 MaSA 分解为水平和垂直注意力是否能在降低计算的同时保留感受野和先验信息？
RQ4基于 MaSA 的骨干（RMT）是否在 ImageNet 与下游视觉任务（检测、分割）上超过最先进的 ViT 骨干？（不需额外数据）
RQ5辅助组件（LCE、CPE、卷积干线）对整体性能有何影响？

主要发现

模型	参数（M）	FLOPs（G）	Top1-acc（%)
RMT-S	27	4.5	84.1
RMT-S*	27	4.5	84.8
RMT-B	54	9.7	85.0
RMT-B*	55	9.7	85.6
RMT-L	95	18.2	85.5
RMT-L*	96	18.2	86.1
RMT-T	14	2.5	82.4

RMT 在 ImageNet-1K 的不同模型规模上达到最先进或具竞争力的顶级准确率，且 FLOPs 友好（例如，RMT-S：4.5 GFLOPs 下 Top-1 为 84.1%；RMT-S*：84.8%；RMT-L*：86.1%）。
RMT-L 在使用更少 FLOPs 的情况下优于 MaxViT-B，且 RMT-B* 达到 85.6% Top-1 准确率。
在下游任务中，RMT 在 COCO 上达到 54.5 的 box AP 和 47.2 的宏观 AP，在 ADE20K 上达到 52.8 mIoU，展示出在检测、分割和语义分割方面的强大能力。
分解的 MaSA（Attn_H 和 Attn_W）在保持空间先验的同时实现全局建模的线性复杂度。
消融研究表明，MaSA 相比于原生注意力在分类任务上提升约 0.8%，而 LCE/CPE/干线贡献了额外的增益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。