Skip to main content
QUICK REVIEW

[论文解读] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

Dongyoung Kim, Sumin Park|arXiv (Cornell University)|Mar 22, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

RoboAlign 通过有监督微调后再进行强化学习,将零-shot 的 embodied 推理与低层次 FAST 动作标记对齐,在 LIBERO、CALVIN 及真实机器人上实现显著的 VLA 增益,所需数据不足 1% 的 RL 数据。

ABSTRACT

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

研究动机与目标

  • 通过 bridg-ing 语言-行动模态差距,推动鲁棒的 embodied 推理在 VLAs 上的解锁。
  • 提出 RoboAlign,通过零-shot 推理生成低层次行动标记,并以 RL 进行 refine。
  • 证明 RL 对齐的模型在机器人基准测试上优于仅 SFT 及其他对齐方法。
  • 演示在不同 MLLM 主干与真实世界机器人任务中的迁移能力。

提出的方法

  • 在 MLLM 主干上使用 SFT,使低层次动作的 FAST 标记生成成为可能。
  • 附加扩散式行动头,并以 RoboAlign VQA 与推理数据等数据集混合进行训练。
  • 将 GRPO 应用于在 RL 循环中以行动准确度为 reward,优化低层次行动标记的准确性。
  • 在 Stage 2,通过在提示中加入 <think>...</think> 以显式推理,最大化格式和准确性 reward。
  • 在 LIBERO、CALVIN 及真实机器人场景进行评估,并与基于语言的 RL、视觉轨迹 RL 及 SFT 基线进行对比。
Figure 1 : Performance on LIBERO. VLAs built upon MLLMs specialized for embodied reasoning (fine-tuned variants of Qwen2.5-VL-7B-Instruct) fail to significantly improve performance and often degrade it compared to the baseline VLA based on the original model. In contrast, RoboAlign achieves signific
Figure 1 : Performance on LIBERO. VLAs built upon MLLMs specialized for embodied reasoning (fine-tuned variants of Qwen2.5-VL-7B-Instruct) fail to significantly improve performance and often degrade it compared to the baseline VLA based on the original model. In contrast, RoboAlign achieves signific

实验结果

研究问题

  • RQ1RoboAlign 是否在仿真和真实机器人基准上持续改进 VLA 性能?
  • RQ2以 RL 为基础的低层次行动对齐是否比高层次语言或 2D 轨迹对齐更有效?
  • RQ3RoboAlign 是否保留或提升通用 MLLM 的 embodied 推理与真实世界泛化能力?
  • RQ4RoboAlign 如何对不同的 MLLM 主干(如 Qwen2.5VL-7B-Ins、Qwen3VL-8B-Ins)进行泛化?

主要发现

  • RoboAlign 相较于 SFT 基线在 VLA 上取得显著增益:LIBERO 17.5%、CALVIN 18.9%、真实世界 106.6%,且 RL 数据占比<1%。
  • 以 RL 对齐的低层次行动在 LIBERO 的长时任务上优于以高层语言 RL 和 2D 轨迹 RL。
  • RL 对齐提高了真实机器人性能,并在不同 MLLM 主干之间具有良好泛化性。
  • RoboAlign 提升了 embodied 推理表示的能力,通过更高的 KNN 准确率(69.79% 对 39.06%)得到体现。
  • 基于 SFT 的对齐(ECoT)可能降低性能,而 RoboAlign 的 RL 基于方法能保持或提升通用 MLLM 能力。
  • RoboAlign 在 embodied 推理基准上达到 state-of-the-art,同时保持通用 MLLM 能力。
Figure 2 : Overview of RoboAlign framework. RoboAlign directly aligns MLLM representations with low-level action generation using reasoning-incentivized reinforcement learning ( guo2025deepseek ) . The framework consists of two stages: (i) Stage 1 integrates embodied reasoning, zero-shot reasoning,
Figure 2 : Overview of RoboAlign framework. RoboAlign directly aligns MLLM representations with low-level action generation using reasoning-incentivized reinforcement learning ( guo2025deepseek ) . The framework consists of two stages: (i) Stage 1 integrates embodied reasoning, zero-shot reasoning,

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。