QUICK REVIEW

[論文レビュー] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu, Bencheng Liao|arXiv (Cornell University)|Jan 17, 2024

Domain Adaptation and Few-Shot Learning被引用数 384

ひとこと要約

Vision Mamba (Vim) は純粋な SSM ビジョンバックボーンとして双方向状態空間モデルを導入し、ViTsより低い計算量・メモリ使用量で高解像度画像において競争力のある精度を達成します。

ABSTRACT

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

研究の動機と目的

視覚データのための注意機構ベースを置換する純粋な状態空間モデルバックボーンを提案する動機づけ。
視覚データに対して双方向状態空間モデリングと位置埋め込みを組み込む。
高解像度画像における計算量とメモリ効率を示す。
ImageNet分類と下流の密な予測タスクにおける ViM の有効性を示す。

提案手法

Mamba ベースの双方向 SSM ブロックを採用し、画像パッチ列を処理する。
学習された射影とゲーティングを用いた前方および後方SSMを適用する Vim ブロックを導入する。
パッチトークンと分類のためのクラスTokenに位置埋め込みを追加する。
メモリとIOを削減するために SRAM/HBM メモリ認識実行と再計算を使用する。
L個の Vim ブロック、D 隠れ次元、E 拡張次元を持つアーキテクチャを提供する。
ImageNet、ADE20K、COCO で Vim を ViT ベースおよび SSM ベースのバックボーンと比較する。

実験結果

リサーチクエスチョン

RQ1純粋な SSM バックボーンは標準ベンチマークで Transformer ベースの視覚モデルと同等以上を達成できるか？
RQ2双方向 SSM モデリングは密な予測に対して十分なグローバルコンテキストと空間認識を提供するか？
RQ3高解像度画像における Vim の効率性（速度とメモリ）は DeiT と比較してどうか？
RQ4分類トークン戦略や双方向構成といった設計選択は分類およびセグメンテーションタスクの性能にどう影響するか？

主な発見

Vim は DeiT より 2.8 倍速く、1248x1248 画像で特徴抽出時の GPU メモリを 86.8% 削減する。
Vim は複数のモデルスケールにおいて ImageNet 分類で DeiT より優れた性能を示す。
後方経路と Conv1d 強化を伴う双方向 SSM は、単方向設定よりセグメンテーションと分類の結果を改善する。
COCO では Vim-Ti が DeiT-Ti を AP および箱・マスクの AP で上回り、長距離コンテキスト学習がより強いことを示す。
Vim は 2D priors なしで高解像度の逐次的視覚表現学習を可能にし、いくつかの設定でパラメータを減らしつつ競争力のあるまたはそれ以上の精度を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。