QUICK REVIEW

[논문 리뷰] MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Ali Hatamizadeh, Jan Kautz|arXiv (Cornell University)|2024. 07. 10.

Advanced Vision and Imaging인용 수 27

한 줄 요약

MambaVision은 재설계된 Mamba 블록과 Transformer 스타일의 어텐션을 결합하여 계층적 비전 백본을 만들어 ImageNet-1K에서 정확도와 이미지 처리량 사이의 최첨단 트레이드오프를 제공하고 다운스트림 태스크에서도 강력한 결과를 얻습니다.

ABSTRACT

We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

연구 동기 및 목표

Mamba 블록을 비전 태스크에 더 적합하도록 재설계하고 정확도와 처리량을 개선한다.
비전용 Mamba와 Transformer 블록의 통합 패턴을 체계적으로 연구한다.
CNN 기반 스테이지와 Mamba 및 Transformer 블록을 결합한 계층적 MambaVision 백본을 제안한다.
ImageNet-1K에서 최첨단 파레토 효율성과 강력한 다운스트림 태스크 성능을 시연한다.

제안 방법

고해상도 특징 추출을 위한 CNN 기반 조기 스테이지와 최종 스테이지를 MambaVision 믹서/MLP 블록과 Transformer 블록으로 혼합하는 4단계 계층형 백본을 도입한다.
SSM 분지의 인과적 Convolution을 일반 Convolution으로 교체하고 대칭적 비-SSM 분기를 추가한 뒤 출력을 연결(concatenate)하고 투영한다.
임베딩 차원을 절반으로 축소하기 전에 1D 컨볼루션 경로(SSM 기반)와 대칭적인 CNN 경로를 결합한 이중 분기 MambaVision 믹서를 사용하고 두 경로를 연결하기 전에 투영한다.
각 스테이지에 self-attention 블록의 삽입을 평가하는 하이브리드 패턴 연구를 채택하고 최종 스테이지 어텐션이 가장 효과적임을 확인한다.
표준 비전 학습 레시피와 다운스트림 태스크 파이프라인을 적용하여 ImageNet-1K에서 분류와 MS COCO 및 ADE20K에서 검출/분할 성능을 평가한다.

Figure 2 : The architecture of hierarchical MambaVision models. The first two stages use residual convolutional blocks for fast feature extraction. Stage 3 and 4 employ both MambaVision and Transformer blocks. Specifically, given $N$ layers, we use $\frac{N}{2}$ MambaVision and MLP blocks which are

실험 결과

연구 질문

RQ1비전 트랜스포머를 Mamba와 통합하는 것이 비전 백본의 성능과 효율성에 어떤 영향을 미치는가?
RQ2하이브리드 Mamba-Transformer 백본에서 가장 높은 정확도-처리량trade-off를 내는 통합 패턴(어떤 층/스테이지인가?)은 무엇인가?
RQ3계층적 MambaVision 백본이 ImageNet-1K 및 다운스트림 비전 태스크에서 기존의 Mamba 및 ViT 백본보다 더 나은 성능을 보일 수 있는가?

주요 결과

모델	이미지 크기	매개변수 (M)	FLOPs (G)	처리량(Img/Sec)	Top-1 (%)
MambaVision-T	224	31.8	4.4	6298	82.3
MambaVision-T2	224	35.1	5.1	5990	82.7
MambaVision-S	224	50.1	7.5	4700	83.3
MambaVision-B	224	97.7	15.0	3670	84.2
MambaVision-L	224	227.9	34.9	2190	85.0
MambaVision-L2	224	241.5	37.5	1021	85.3

MambaVision 변형은 ImageNet-1K에서 최고 85.3%의 Top-1 정확도와 높은 이미지 처리량을 달성한다.
MambaVision-T는 82.3% Top-1에 6298 Img/s 처리량, 31.8M 매개변수를 기록했다.
MambaVision-S는 83.3% Top-1에 4700 Img/s 처리량, 50.1M 매개변수를 기록했다.
MambaVision-B는 84.2% Top-1에 3670 Img/s 처리량, 97.7M 매개변수를 기록했다.
MambaVision-L은 85.0% Top-1에 2190 Img/s 처리량, 227.9M 매개변수를 기록했다.
MambaVision-L2는 85.3% Top-1에 1021 Img/s 처리량, 241.5M 매개변수를 기록했다.

Figure 3 : Architecture of MambaVision block. In addition to replacing causal Conv layer with their regular counterparts, we create a symmetric path without SSM as a token mixer to enhance the modeling of global context.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.