QUICK REVIEW

[論文レビュー] EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Xiaohuan Pei, Tao Huang|arXiv (Cornell University)|Mar 15, 2024

Advanced Vision and Imaging被引用数 12

ひとこと要約

EfficientVMambaはアトラスベースの選択的スキャン戦略（ES2D）とデュアルパスのEVSSブロックを導入し、グローバルな状態空間モデリングと局所畳み込みを統合して、視覚タスクで競争力のある精度を維持しつつFLOPsを低減します。

ABSTRACT

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.

研究の動機と目的

高い計算コストを伴わずにグローバルな文脈を維持する軽量な視覚モデルを提案する。
グローバル受容野を保ちながらスキャンの複雑さを低減する ES2D を提案する。
SE フュージョンを通じてグローバルな状態空間表現と局所畳み込みを融合する EVSS ブロックを導入する。
段階間のブロック配置を最適化するための反転挿入を検討する。
画像分類、物体検出、セマンティックセグメンテーション全般で有効性を示す。

提案手法

スキップサンプリングを用いたアトラスベースの選択的スキャン（ES2D）を導入し、走査トークンをNからN/p^2へ削減する。
ES2Dベースのグローバル特徴を3×3畳み込みブランチとSE再較正と融合するEfficient Visual State Space (EVSS) ブロックを開発する。
SEの後、要素ごとの加算によってグローバル経路と局所経路を融合し、X^{l+1} = SE(ES2D(X^l)) + SE(Conv(X^l)) を得る。
グローバル表現のために初期段階にEVSSブロックを配置し、局所特徴のために後半の段階でInResブロックを配置する反転挿入を採用する。
FLOPsとパラメータが順次大きくなる3つのモデル変種（EfficientVMamba-T、-S、-B）を提供する。

実験結果

リサーチクエスチョン

RQ1ES2Dは視覚タスクにおいてグローバルなスキャンの計算コストを削減しつつグローバル文脈を保持できるか？
RQ2グローバルES2D経路と局所畳み込みブランチを組み合わせると、厳しい資源制約下で精度が向上するか？
RQ3軽量モデルでSSMベースのブロックとCNNブロックを組み合わせる際、反転残差挿入は有利か？
RQ4従来の軽量バックボーンと比較して、ImageNet分類、COCOオブジェクト検出、ADE20KセマンティックセグメンテーションにおけるEfficientVMamba変種の性能はどうか？

主な発見

EfficientVMamba-T/S/Bは低FLOPs（それぞれ0.8/1.3/4.0 GFLOPs）で競争力のあるImageNet精度を達成。
EfficientVMamba-Sは1.3 GFLOPsでImageNetのトップ1を78.7%に達し、いくつかのより大きなバックボーンを上回る。
EfficientVMamba-Bは4.0 GFLOPsと33MパラメータでImageNetトップ1 81.8%に到達。
COCO RetinaNet 実験で、EfficientVMamba-Tは37.5 AP、EfficientVMamba-Bは42.8 APを、いくつかのベースラインよりも小さなパラメータ数で達成。
ADE20Kセマンティックセグメンテーションでは、EfficientVMamba系はより重いモデルよりも上回り、競争力のあるmIoUスコアを示す（例：SSテストで各変種で46.5%〜46.5%+）。
アブレーションによりES2DがFLOPsを削減しつつ精度を維持し、SEとの融合が性能を向上させること、また反転挿入は初期段階でグローバル特徴をより活用できることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。