QUICK REVIEW

[論文レビュー] Lightweight Vision Transformer with Bidirectional Interaction

Qihang Fan, Huaibo Huang|arXiv (Cornell University)|Jun 1, 2023

Visual Attention and Saliency Detection被引用数 13

ひとこと要約

Fully Adaptive Self-Attention (FASA) を提案し、局所とグローバル特徴を双方向の相互作用でモデル化する。これにより、パラメータと FLOPs を最小限に抑えつつ、ImageNet、COCO、ADE20K で高い性能を達成する軽量バックボーンファミリ（FAT）を生み出す。

ABSTRACT

Recent advancements in vision backbones have significantly improved their performance by simultaneously modeling images' local and global contexts. However, the bidirectional interaction between these two contexts has not been well explored and exploited, which is important in the human visual system. This paper proposes a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information as well as the bidirectional interaction between them in context-aware ways. Specifically, FASA employs self-modulated convolutions to adaptively extract local representation while utilizing self-attention in down-sampled space to extract global representation. Subsequently, it conducts a bidirectional adaptation process between local and global representation to model their interaction. In addition, we introduce a fine-grained downsampling strategy to enhance the down-sampled self-attention mechanism for finer-grained global perception capability. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family. Extensive experiments on multiple vision tasks demonstrate that FAT achieves impressive performance. Notably, FAT accomplishes a 77.6% accuracy on ImageNet-1K using only 4.5M parameters and 0.7G FLOPs, which surpasses the most advanced ConvNets and Transformers with similar model size and computational costs. Moreover, our model exhibits faster speed on modern GPU compared to other models. Code will be available at https://github.com/qhfan/FAT.

研究の動機と目的

局所情報とグローバル情報を共同モデリングする効率的なビジョンバックボーンの必要性を動機づける。
Fully Adaptive Self-Attention (FASA) モジュールを導入し、局所/グローバル表現とその双方向の相互作用を捉える。
分類・検出・分割のための軽量バックボーンファミリとして Fully Adaptive Transformer (FAT) を開発する。
自己注意における細粒度のダウンサンプリングでグローバルな認識を改善し、効率を維持しつつ細部を保持する。

提案手法

FASA は Global Adaptive Aggregation、Local Adaptive Aggregation、Bidirectional Adaptive Interaction の CAFA ベースの3つのコンポーネントからなる。
グローバルアグリゲーションで細粒度のダウンサンプリングを用い、過度なコストをかけずにグローバル認識を向上させる。
畳み込み幹部、CPE、ConvFFN、ショートカット接続を備えた階層的 FAT バックボーンに FASA を組み込む。
ImageNet-1K、COCO、ADE20K の分類、検出/セマンティックセグメンテーション、およびセマンティック分割タスクで FAT を訓練・評価する。
双方向の相互作用の有効性、細粒度ダウンサンプリング、位置エンコーディングの ablation を提示する。

実験結果

リサーチクエスチョン

RQ1局所特徴とグローバル特徴の双方向相互作用は、パラメータや FLOPs を大幅に増やすことなく、軽量 Vision Transformer の性能を向上させられるか？
RQ2自己注意における細粒度ダウンサンプリング戦略は、軽量バックボーンで大きなストライドのダウンサンプリングよりグローバル情報をよりよく保持するか？
RQ3FAT は ImageNet-1K、COCO、ADE20K において、精度と効率の点で最先端の軽量バックボーンと比較してどうか？

主な発見

モデル	入力	パラメータ数(M)	FLOPs(G)	スループット(img/s)	Top-1 (%)
FAT-B0	224^2	4.5	0.7	1932	77.6
FAT-B1	224^2	7.8	1.2	1452	80.1
FAT-B2	224^2	13.5	2.0	1064	81.9
FAT-B3	224^2	29.0	4.4	474	83.6

FAT-B0 は ImageNet-1K で top-1 精度 77.6%、パラメータ 4.5M、FLOPs 0.7 GFLOPs。
FAT-B1、FAT-B2、FAT-B3 は、同様のコストで軽量バックボーンの最先端レベルに達し、例えば FAT-B3 は ImageNet-1K で top-1 83.6% に到達。
ADE20K では FAT-B1、FAT-B2、FAT-B3 が、競合する軽量バックボーンに対して mIoU の改善を達成（例: FAT-B1 +1.5 mIoU、EdgeViT-XS、FAT-B3 +0.7 mIoU、Shunted-S）。
COCO の物体検出/インスタンス分割では、RetinaNet および Mask R-CNN 設定で FAT バックボーンが他の対称を上回る。
Ablation により、双方向適応相互作用が単純な融合ベースラインを上回り、細粒度ダウンサンプリングが非重複プーリング/ダウンサンプリング variant を上回ることを確認。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。