QUICK REVIEW

[논문 리뷰] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Jiaming Liu, Mengzhen Liu|arXiv (Cornell University)|2024. 06. 06.

AI-based Problem Solving and Planning인용 수 7

한 줄 요약

RoboMamba는 Mamba 상태-공간 모델과 비전 인코더를 혼합하여 시각적 추론과 포즈-예측 조작을 가능하게 하는 엔드 투 엔드 로봇 멀티모달 LLM이며, 초효율적인 미세조정과 빠른 추론으로 작동합니다.

ABSTRACT

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: https://sites.google.com/view/robomamba-web

연구 동기 및 목표

로봇이 시각적 장면을 이해하고 엔드투엔드 멀티모달 LLM을 통해 행동을 수행하도록 하는 것을 목표로 한다.
선택적 상태 공간 모델(SSM) 접근법(Mamba)을 활용하여 선형 복잡도에서의 효율적인 추론을 달성한다.
시각 데이터와 언어 임베딩을 맞춤화하기 위해 시각 인코더를 통합하여 시각적 일반상식 및 로봇 관련 추론을 강화한다.
매우 경량의 미세조정 전략을 개발하여 최소 매개변수와 시간으로 엔드-이펙터 포즈 예측을 가능하게 한다.

제안 방법

클립(CLIP) 기반 비전 인코더를 Mamba 언어 모델과 교차 모달 MLP 커넥터를 사용해 시각 특징을 Mamba의 토큰 공간으로 매핑하도록 통합한다.
시각적 일반상식 및 로봇 관련 추론을 주입하기 위해 정렬 사전 학습(Stage 1.1)과 명령 공동 학습(Stage 1.2)의 두 단계로 훈련한다.
Stage 1이 이미지-텍스트 데이터에 대한 정렬 사전 학습과 혼합 비전-언어 데이터셋 plus RoboVQA 데이터에 대한 지시사항 공동 학습을 포함하는 두 단계의 학습 파이프라인을 사용한다.
Stage 2는 메인 모델을 고정하는 동안 6-자유도 엔드 이펙터 포즈(2D 위치와 3D 방향, 또는 그리퍼를 포함한 7-DoF)를 예측하는 간단한 정책 헤드를 갖춘 효율적인 조작 미세조정을 도입한다.
정책 헤드는 apos와 adir를 위한 두 개의 MLP로 구성되며 합계 약 3.7M 매개변수(모델의 0.1%)이고 약 20분의 미세조정을 가능하게 한다.

실험 결과

연구 질문

RQ1엔드-투-엔드 로봇 초점 MLLM이 효율적인 추론 및 미세조정을 유지하면서 강력한 추론을 달성할 수 있는가?
RQ2Mamba와 비전 인코더를 통합하는 것이 조작(task) 작업에 대해 강건한 시각적 일반상식 및 로봇 관련 추론을 제공하는가?
RQ3가벼운 정책 헤드 기반 미세조정 방식이 LLM의 추론 능력을 저하시키지 않으면서 신뢰할 수 있는 엔드-이펙터 포즈 예측을 획득하기에 충분한가?

주요 결과

RoboMamba는 2.7B 매개변수 모델로 다수 벤치마크(OKVQA, VQAv2, GQA, VizWiz, OCR-VQA, POPE, MME, MMB, MM-Vet)에서 경쟁력 있는 일반-언어-비전 추론을 달성한다.
RoboVQA에서의 로봇 관련 추론은 베이스라인 대비 BLEU 점수가 우수하며, 기존의 로봇 MLLMs에 비해 추론 속도가 약 7배 빠르다.
SAPIEN 시뮬레이션에서 RoboMamba는 7MB 정책 헤드를 사용하고 A100 GPU에서 20분 이내의 미세조정으로 최첨단 조작 성능을 달성한다.
포즈 예측 미세조정은 모델 매개변수의 0.1%(3.7M)와 약 20분만으로도 가능하며, 추론 능력이 조작 기술의 효율적 습득을 가능하게 함을 시사한다.
현장(real-world) 실험에서 RoboMamba는 장기적 계획 작업을 수행하고 엔드이펙터 포즈를 예측하며 강력한 추론 및 어포던스 추론 능력을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.