QUICK REVIEW

[论文解读] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

Jiaming Liu, Mengzhen Liu|arXiv (Cornell University)|Jun 6, 2024

AI-based Problem Solving and Planning被引用 7

一句话总结

RoboMamba 是一个端到端的机器人多模态大语言模型，结合 Mamba 状态空间模型与视觉编码器，以实现可视推理和姿态预测操控，具备超高效的微调和快速推理。

ABSTRACT

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: https://sites.google.com/view/robomamba-web

研究动机与目标

旨在使机器人能够理解视觉场景并通过一个端到端的多模态大模型执行动作。
利用选择性的状态空间模型（SSM）方法（Mamba）实现线性复杂度的高效推理。
整合视觉编码器以将视觉数据与语言嵌入对齐，提升视觉常识和机器人相关推理。
开发超轻量级的微调策略，以最少的参数和时间实现末端执行器姿态预测。

提出的方法

使用基于 CLIP 的视觉编码器与 Mamba 语言模型通过跨模态 MLP 连接器对齐视觉特征，使其映射到 Mamba 的 token 空间。
分两阶段训练：对齐预训练（阶段 1.1）和指令共同训练（阶段 1.2），以灌输视觉/常识和机器人相关推理。
采用两阶段训练流程，其中阶段 1 包括在图像-文本数据上的对齐预训练和在混合视觉-语言数据集加上 RoboVQA 数据上的指令共同训练。
阶段 2 引入一个高效的操控微调，配合一个简单的策略头来预测 6 自由度末端执行器姿态（2D 位置加上 3D 方向，或在带夹爪时为 7-DoF），同时冻结主模型。
策略头包含两个 MLP，用于 apos 和 adir，总计约 370 万参数（占模型的 0.1%），并实现约 20 分钟的微调。

实验结果

研究问题

RQ1端到端的面向机器人的大语言模型是否能够在保持高效推理和微调的同时实现强推理？
RQ2将视觉编码器与 Mamba 集成是否能在操控任务中产生稳健的视觉常识和机器人相关推理？
RQ3基于轻量级策略头的微调方法是否足以获得可靠的末端执行器姿态预测，而不削弱模型的推理能力？

主要发现

RoboMamba 在多项基准测试（OKVQA、VQAv2、GQA、VizWiz、OCR-VQA、POPE、MME、MM-B、MM-Vet）上，使用 2.7B 参数模型实现了与通用语言-视觉推理相竞争的能力。
在 RoboVQA 上的机器人相关推理显示出优越的 BLEU 分数，相较于基线，推理速度约比先前的机器人 MLLMs 快 7 倍。
在 SAPIEN 仿真中，RoboMamba 使用一个 7MB 的策略头并在 A100 GPU 上微调不到 20 分钟即可达到最先进的操控性能。
姿态预测微调仅需要占模型参数的 0.1%（3.7M）和约 20 分钟，表明推理能力使得操控技能的获取更加高效。
现实世界实验表明 RoboMamba 能规划长跨度任务并预测末端执行器姿态，具备强推理和可供性推理能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。