QUICK REVIEW

[논문 리뷰] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Lianghui Zhu, Bencheng Liao|arXiv (Cornell University)|2024. 01. 17.

Domain Adaptation and Few-Shot Learning인용 수 384

한 줄 요약

Vision Mamba (Vim)은 순수 SSM 비전 백본으로 양방향 상태 공간 모델을 도입하여, 고해상도 이미지에서 ViT보다 계산량과 메모리가 더 낮으면서도 경쟁력 있는 정확도를 달성합니다.

ABSTRACT

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$ imes$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$ imes$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

연구 동기 및 목표

비전용 주의 기반 아키텍처를 대체할 순수 상태 공간 모델 백본의 필요성 제기.
시각 데이터에 대한 양방향 상태 공간 모델링 및 위치 임베딩 도입.
고해상도 이미지에서의 계산 및 메모리 효율성 시연.
ImageNet 분류 및 다운스트림 밀집 예측 작업에서 ViM의 효과성 시연.

제안 방법

이미지 패치 시퀀스를 처리하기 위해 Mamba 기반의 양방향 SSM 블록 채택.
학습된 프로젝션 및 게이팅을 사용하여 순방향 및 역방향 SSM을 적용하는 Vim 블록 도입.
분류를 위한 패치 토큰 및 클래스 토큰에 위치 임베딩 추가.
메모리 및 IO를 줄이기 위한 SRAM/HBM 메모리 인식 실행 및 재계산 사용.
L Vim 블록, D 은닉 차원, E 확장 차원을 갖는 아키텍처 제공.
ImageNet, ADE20K, COCO에서 Vim을 ViT 기반 백본 및 SSM 기반 백본과 비교.

실험 결과

연구 질문

RQ1순수-SSM 백본이 표준 벤치마크에서 Transformer 기반 비전 모델과 대등하거나 능가할 수 있는가?
RQ2양방향 SSM 모델링이 밀집 예측에 충분한 글로벌 컨텍스트와 공간 인식을 제공하는가?
RQ3DeiT와 비교할 때 고해상도 이미지에서 Vim의 속도 및 메모리 효율성은 어떠한가?
RQ4분류 토큰 전략, 양방향 구성 등의 설계 선택이 분류 및 분할 작업 성능에 미치는 영향은 무엇인가?

주요 결과

Vim은 DeiT보다 2.8배 빠르고 1248x1248 이미지에서 특징 추출 시 GPU 메모리를 86.8% 절감합니다.
Vim은 ImageNet 분류에서 여러 모델 규모에 걸쳐 DeiT보다 우수한 성능을 보입니다.
역방향 경로와 Conv1d 보강을 통한 양방향 SSM이 단방향 구성보다 분할 및 분류 성능을 더 잘 제공합니다.
COCO에서 Vim-Ti는 DeiT-Ti보다 AP 및 상자/마스크 AP에서 우수한 성능을 보이며, 장거리 컨텍스트 학습이 더 강하다는 것을 보여줍니다.
Vim은 2D 사전 지식 없이도 고해상도 시퀀스 시각 표현 학습을 가능하게 하며, 여러 설정에서 매개변수가 더 적은 상태에서도 경쟁력 있거나 더 나은 정확도를 유지합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.