QUICK REVIEW

[论文解读] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

Dongyoung Kim, Sumin Park|arXiv (Cornell University)|Mar 22, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

RoboAlign 通过有监督微调后再进行强化学习，将零-shot 的 embodied 推理与低层次 FAST 动作标记对齐，在 LIBERO、CALVIN 及真实机器人上实现显著的 VLA 增益，所需数据不足 1% 的 RL 数据。

ABSTRACT

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

研究动机与目标

通过 bridg-ing 语言-行动模态差距，推动鲁棒的 embodied 推理在 VLAs 上的解锁。
提出 RoboAlign，通过零-shot 推理生成低层次行动标记，并以 RL 进行 refine。
证明 RL 对齐的模型在机器人基准测试上优于仅 SFT 及其他对齐方法。
演示在不同 MLLM 主干与真实世界机器人任务中的迁移能力。

提出的方法

在 MLLM 主干上使用 SFT，使低层次动作的 FAST 标记生成成为可能。
附加扩散式行动头，并以 RoboAlign VQA 与推理数据等数据集混合进行训练。
将 GRPO 应用于在 RL 循环中以行动准确度为 reward，优化低层次行动标记的准确性。
在 Stage 2，通过在提示中加入 <think>...</think> 以显式推理，最大化格式和准确性 reward。
在 LIBERO、CALVIN 及真实机器人场景进行评估，并与基于语言的 RL、视觉轨迹 RL 及 SFT 基线进行对比。

Figure 1 : Performance on LIBERO. VLAs built upon MLLMs specialized for embodied reasoning (fine-tuned variants of Qwen2.5-VL-7B-Instruct) fail to significantly improve performance and often degrade it compared to the baseline VLA based on the original model. In contrast, RoboAlign achieves signific

实验结果

研究问题

RQ1RoboAlign 是否在仿真和真实机器人基准上持续改进 VLA 性能？
RQ2以 RL 为基础的低层次行动对齐是否比高层次语言或 2D 轨迹对齐更有效？
RQ3RoboAlign 是否保留或提升通用 MLLM 的 embodied 推理与真实世界泛化能力？
RQ4RoboAlign 如何对不同的 MLLM 主干（如 Qwen2.5VL-7B-Ins、Qwen3VL-8B-Ins）进行泛化？

主要发现

RoboAlign 相较于 SFT 基线在 VLA 上取得显著增益：LIBERO 17.5%、CALVIN 18.9%、真实世界 106.6%，且 RL 数据占比＜1%。
以 RL 对齐的低层次行动在 LIBERO 的长时任务上优于以高层语言 RL 和 2D 轨迹 RL。
RL 对齐提高了真实机器人性能，并在不同 MLLM 主干之间具有良好泛化性。
RoboAlign 提升了 embodied 推理表示的能力，通过更高的 KNN 准确率（69.79% 对 39.06%）得到体现。
基于 SFT 的对齐（ECoT）可能降低性能，而 RoboAlign 的 RL 基于方法能保持或提升通用 MLLM 能力。
RoboAlign 在 embodied 推理基准上达到 state-of-the-art，同时保持通用 MLLM 能力。

Figure 2 : Overview of RoboAlign framework. RoboAlign directly aligns MLLM representations with low-level action generation using reasoning-incentivized reinforcement learning ( guo2025deepseek ) . The framework consists of two stages: (i) Stage 1 integrates embodied reasoning, zero-shot reasoning,

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。