QUICK REVIEW

[論文レビュー] RMT: Retentive Networks Meet Vision Transformers

Qihang Fan, Huaibo Huang|arXiv (Cornell University)|Sep 20, 2023

Advanced Neural Network Applications被引用数 15

ひとこと要約

この論文は Manhattan 距離ベースの空間的減衰自己注意機構 MaSA を導入し、RMT を実現する。RMT は空間的事前情報を明示的に組み込み、直線的計算量を持つビジョン・バックボーンで、ImageNet で最先端の結果を達成し、検出と分割タスク全般で強力な性能を示す。

ABSTRACT

Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However, the core component of ViT, Self-Attention, lacks explicit spatial priors and bears a quadratic computational complexity, thereby constraining the applicability of ViT. To alleviate these issues, we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP, and propose RMT, a strong vision backbone with explicit spatial prior for general purposes. Specifically, we extend the RetNet's temporal decay mechanism to the spatial domain, and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally, an attention decomposition form that adeptly adapts to explicit spatial prior is proposed, aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form, we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically, without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is available at https://github.com/qhfan/RMT

研究の動機と目的

Vision Transformers (ViT) における明示的な空間事前情報の必要性を動機づけ、データ効率を改善し二次複雑性を減らす。
Self-Attention に統合された Manhattan distance ベースの空間減衰機構 MaSA を提案する。
空間的 priors を乱さずに線形計算量で全局情報をモデル化する分解型の注意力の定式化を開発する。
MaSA を用いた統一的バックボーン (RMT) を組み立て、分類・検出・分割タスクで強力な性能を示す。

提案手法

RetNet の時間的減衰を 2 次元空間的減衰へ拡張し、Manhattan 距離ベースの行列 D^{2d} を導入する。
MaSA を Softmax(QK^T) ⊙ D^{2d} の後に V を適用する形式として定義し、Self-Attention に明示的な空間事前情報を導入する。
一方向性の減衰を水平方向と垂直方向の二つの軸に沿って一次元で適用する分解型 MaSA (Attn_H と Attn_W) を導入し、線形計算量を維持しつつ空間事前情報を保持する。
局所的な表現と位置情報を強化するために Local Context Enhancement (LCE) と Convolutional Position Encoding (CPE) を追加する。
前半の三つのステージで分解型 MaSA、最終ステージで MaSA を用いた四段階のビジョン・バックボーン (RMT) を、畳み込み幹と標準化された学習設定とともに構築する。

実験結果

リサーチクエスチョン

RQ1空間的事前情報を明示的に Vision Transformers に組み込むことで、効率を犠牲にせず精度を向上させられるか。
RQ2二次元の空間減衰メカニズムを設計・Self-Attention に統合し、線形計算量を維持できるか。
RQ3 MaSA を水平方向と垂直方向の注意に分解することで、受容野と事前情報を保ちつつ計算量を削減できるか。
RQ4MaSA ベースのバックボーン (RMT) は ImageNet および下流の検出・分割タスクで最先端の ViT バックボーンを上回り、追加データなしで達成できるか。
RQ5補助的コンポーネント (LCE, CPE, convolution stem) が全体の性能に与える影響はどの程度か。

主な発見

モデル	パラメータ (M)	FLOPs (G)	Top1-acc (%)
RMT-S	27	4.5	84.1
RMT-S*	27	4.5	84.8
RMT-B	54	9.7	85.0
RMT-B*	55	9.7	85.6
RMT-L	95	18.2	85.5
RMT-L*	96	18.2	86.1
RMT-T	14	2.5	82.4

RMT はモデルスケールを通じて ImageNet-1K で最先端または競合的な Top-1 精度を達成しており、FLOPs は望ましい（例: RMT-S: 4.5 GFLOPs で Top-1 84.1%; RMT-S*: 84.8%; RMT-L*: 86.1%）。
RMT-L は MaxViT-B より優れた性能を、FLOPs を抑えて示し、RMT-B* は Top-1 精度 85.6% に到達。
下流タスクでは、RMT は COCO で 54.5 box AP と 47.2 mask AP、ADE20K で 52.8 mIoU を達成し、検出・分割・セマンティック分割の強力な能力を示す。
分解型 MaSA (Attn_H と Attn_W) は空間事前情報を保持しつつ、全局モデリングの線形計算量を実現している。
アブレーション研究では、MaSA が vanilla 注意より分類で約 0.8% の性能向上をもたらし、LCE/CPE/ stem が追加的な利益をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。