QUICK REVIEW

[論文レビュー] MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz|arXiv (Cornell University)|Jul 10, 2024

Advanced Vision and Imaging被引用数 27

ひとこと要約

MambaVisionは再設計されたMambaブロックとTransformer風のアテンションを組み合わせ、ImageNet-1Kでの精度と画像スループットの最適なトレードオフを提供する階層的なビジョンバックボーンを作成し、下流タスクでも強力な結果を示す。

ABSTRACT

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

研究の動機と目的

ビジョンタスクにより適したMambaブロックへ再設計し、精度とスループットを向上させる。
ビジョン用のMambaブロックとTransformerブロックの統合パターンを体系的に研究する。
CNNベースの段階とMambaおよびTransformerブロックを組み合わせた階層的なMambaVisionバックボーンを提案する。
ImageNet-1Kで最先端のパレート効率を実証し、下流タスクで競争力のある性能を示す。

提案手法

高解像度特徴抽出のためのCNNベースの初期段階を備えた階層的な4段バックボーンを導入し、最終段階ではMambaVisionミキサー/MLPブロックとTransformerブロックを混合する。
SSMブランチの因果畳み込みを通常畳み込みに置換し、対称の非SSMブランチを追加してから出力を連結・射影する。
1D畳み込みパス（SSMベース）のデュアルブランチMambaVisionミキサーを使用し、対称のCNNパスと組み合わせ、連結前にどちらも埋め込み次元の半分へ射影する。
ハイブリッドパターン研究を採用して、段階全体での自己注意ブロックの挿入パターンを評価し、最終段の注意が最も効果的であることを確認する。
ImageNet-1Kでの分類と、MS COCOおよびADE20Kでの検出/セグメンテーションを評価するために、標準的な視覚訓練レシピと下流タスクパイプラインを適用する。

Figure 2 : The architecture of hierarchical MambaVision models. The first two stages use residual convolutional blocks for fast feature extraction. Stage 3 and 4 employ both MambaVision and Transformer blocks. Specifically, given $N$ layers, we use $\frac{N}{2}$ MambaVision and MLP blocks which are

実験結果

リサーチクエスチョン

RQ1Vision TransformersをMambaと統合することは、視覚バックボーンの性能と効率にどのような影響を与えるか？
RQ2どの統合パターン（どの層/段階）が、ハイブリッドMamba-Transformerバックボーンにおける最良の精度とスループットのトレードオフを生み出すか？
RQ3階層的なMambaVisionバックボーンは、ImageNet-1Kおよび下流の視覚タスクで既存のMambaやViTバックボーンを上回ることができるか？

主な発見

モデル	画像サイズ	パラメータ（M）	FLOPs（G）	スループット（Img/秒）	Top-1（％）
MambaVision-T	224	31.8	4.4	6298	82.3
MambaVision-T2	224	35.1	5.1	5990	82.7
MambaVision-S	224	50.1	7.5	4700	83.3
MambaVision-B	224	97.7	15.0	3670	84.2
MambaVision-L	224	227.9	34.9	2190	85.0
MambaVision-L2	224	241.5	37.5	1021	85.3

MambaVision系は、ImageNet-1Kで最大85.3%のTop-1精度を達成し、高い画像スループットを実現する。
MambaVision-TはTop-1 82.3%、スループット6298 Img/s、パラメータ31.8M。
MambaVision-SはTop-1 83.3%、スループット4700 Img/s、パラメータ50.1M。
MambaVision-BはTop-1 84.2%、スループット3670 Img/s、パラメータ97.7M。
MambaVision-LはTop-1 85.0%、スループット2190 Img/s、パラメータ227.9M。
MambaVision-L2はTop-1 85.3%、スループット1021 Img/s、パラメータ241.5M。

Figure 3 : Architecture of MambaVision block. In addition to replacing causal Conv layer with their regular counterparts, we create a symmetric path without SSM as a token mixer to enhance the modeling of global context.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。