QUICK REVIEW

[論文レビュー] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

Dongyoung Kim, Sumin Park|arXiv (Cornell University)|Mar 22, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

RoboAlign は監督付き微調整に続く強化学習により、低レベルの FAST アクション・トークンを用いたゼロショットの embodiment reasoning を整合させ、LIBERO、CALVIN、実世界ロボットで追加データが1%未満で顕著な VLA 向上を達成します。

ABSTRACT

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

研究の動機と目的

VLAs の堅牢な embodiment reasoning の解放を、言語-行動モダリティのギャップを埋めることで動機付ける。
RoboAlign を提案し、ゼロショット推論を通じて低レベルアクション・トークンを生成し RL で洗練させる。
RL による整合化モデルが、SFT のみおよび他の整合化手法よりロボティクスのベンチマークで優れていることを示す。
MLLM バックボーン間および実世界ロボティクスタスクへの転移性を示す。

提案手法

MLLM のバックボーンの上に FAST トークン生成を可能にする SFT を使用する。
拡散ベースのアクションヘッドを追加し、RoboAlign VQA および推論データを含むデータセット混合で訓練する。
RL ループでアクション精度を最適化するためにアクション-精度報酬を用いて GRPO を適用する。
Stage 2 では <think>...</think> をプロンプトに追加して明示的な推論を促し、形式と精度の報酬を最大化する。
LIBERO、CALVIN、実ロボット設定を横断的に評価し、言語ベースの RL、視覚トラジェクトリ RL、SFT 境界と比較する。

Figure 1 : Performance on LIBERO. VLAs built upon MLLMs specialized for embodied reasoning (fine-tuned variants of Qwen2.5-VL-7B-Instruct) fail to significantly improve performance and often degrade it compared to the baseline VLA based on the original model. In contrast, RoboAlign achieves signific

実験結果

リサーチクエスチョン

RQ1RoboAlign はシミュレーションと実ロボットのベンチマークを通じて一貫して VLA 性能を向上させるか。
RQ2低レベルのアクションを用いた RL ベースの整合化は、上位レベルの言語や 2D トラジェクトリ整合化より効果的か。
RQ3RoboAlign は一般的な MLLM embodied reasoning と実世界の一般化を維持または強化するか。
RQ4RoboAlign は異なる MLLM バックボーン（例：Qwen2.5VL-7B-Ins、Qwen3VL-8B-Ins）にどう一般化するか。

主な発見

RoboAlign は SFT ベースの基準より顕著な VLA 向上をもたらす：LIBERO で 17.5%、CALVIN で 18.9%、実世界で 106.6%、RL データは <1% で。
低レベルアクションを用いた RL ベースの整合化は、LIBERO の長期的なタスクで高レベル言語 RL および 2D トラジェクトリ RL より優れている。
RL 整合化は実ロボットの性能を向上させ、異なる MLLM バックボーン間で一般化する。
RoboAlign は embodiment reasoning 表現を強化し、KNN 精度が高くなる（69.79% 対 39.06%）。
SFT ベースの整合化（ECoT）は性能を低下させる可能性があり、RoboAlign の RL ベース手法は一般的な MLLM 能力を維持または向上させる。
RoboAlign は embodiment reasoning ベンチマークで最先端の性能を達成しつつ一般的な MLLM 能力を保持する。

Figure 2 : Overview of RoboAlign framework. RoboAlign directly aligns MLLM representations with low-level action generation using reasoning-incentivized reinforcement learning ( guo2025deepseek ) . The framework consists of two stages: (i) Stage 1 integrates embodied reasoning, zero-shot reasoning,

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。