QUICK REVIEW

[論文レビュー] PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

Chenhongyi Yang, Zehui Chen|arXiv (Cornell University)|Mar 26, 2024

Advanced Image and Video Retrieval Techniques被引用数 25

ひとこと要約

PlainMamba は、連続的な2Dスキャンと方向感知更新を用いて、階層モデルと比較して複雑さを低く抑えつつ、視覚認識のための非階層的な状態空間モデルで画像を効率的に処理する単純なモデルです。

ABSTRACT

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks, achieving performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at: https://github.com/ChenhongyiYang/PlainMamba .

研究の動機と目的

広い視覚タスク向けの、Mambaに触発されたシンプルで非階層的な視覚エンコーダを動機づける。
空間連続性を保持するために、選択的スキャンを2D画像データに適用する。
空間関係を符号化するために、連続的な2Dスキャンと方向感知更新を導入する。
幅を一定に保ち、CLSトークンを避けるスケーラブルなPlainMambaバリアントを提供する。
ImageNet分類、COCO検出、ADE20Kセグメンテーションで競争力のある性能を示す。

提案手法

入力依存の状態更新を行うために、状態空間モデリング(SSM)とMambaアプローチを再検討する。
画像から視覚トークンを生成する畳み込みトークナイザーを導入する。
幅を一定に保ち、CLSトークンを避けるために同一のPlainMambaブロックを積み重ねる。
スキャン中に2D空間でトークン隣接性を確保する連続的な2Dスキャンを開発する。
選択的スキャンに2D相対位置情報を注入する方向感知更新を追加する。
深さ/幅を増やした3つのPlainMambaバリアント（L1、L2、L3）を定義し、FLOPsとパラメータを報告する。

実験結果

リサーチクエスチョン

RQ1CLSトークンや階層的マルチスケール構造を用いず、非階層的なSSMベースのエンコーダは標準的な視覚タスクでどのように性能を発揮するか？
RQ2連続的な2Dスキャンと方向感知更新は、SSMベースの視覚モデルにおける2D空間学習を改善できるか？
RQ3PlainMambaバリアントは、分類・検出・セグメンテーションにおいて、非階層的SSM、トランスフォーマー、階層モデルとどのように比較されるか？

主な発見

Model	Hierarchical	Params	FLOPs	Top-1
PlainMamba-L1	No	7.3M	3.0G	77.9
PlainMamba-L2	No	25M	8.1G	81.6
PlainMamba-L3	No	50M	14.4G	82.3

PlainMamba-L2とPlainMamba-L3は、非階層的SSMおよびトランスフォーマーと比較してImageNet-1Kで競争力のTop-1精度を達成し、同程度のサイズの階層モデルに近い。
PlainMambaは、同様のパラメータ予算で、以前の非階層的SSM（例：Vision Mamba、Mamba-ND）を上回る。
PlainMambaは、セマンティックセグメンテーション（ADE20K）と物体検出（COCO）で非階層的ベースラインと同等または上回り、いくつかの構成ではより少ないパラメータと低FLOPsを使用。
アブレーション研究は、幅と深さをバランスさせたより深いモデルが一般に精度を向上させる一方、特定の深さと幅を超えると収益が低下することを示唆。
CLSトークンベースまたは階層的アプローチと比較して、PlainMambaはよりシンプルでスケーラブルなバックボーンを提供し、競争力のある性能とモダリティ横断の統合を容易にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。