QUICK REVIEW

[논문 리뷰] RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

Dongyoung Kim, Sumin Park|arXiv (Cornell University)|2026. 03. 22.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

RoboAlign은 MLLMs를 제로샷 체현 추론을 저수준 FAST 액션 토큰과 일치시키도록 감독 학습 미세조정에 이어 강화 학습을 통해 학습시키며, LIBERO, CALVIN 및 실제 로봇에서 추가 데이터가 1% 미만인 상태로도 VLA를 크게 향상시킵니다.

ABSTRACT

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

연구 동기 및 목표

VLAs 위한 견고한 체현 추론의 잠금 해제를 언어-행동 모달리티 격차를 해소하는 동기 부여.
로보얼라인을 제로샷 추론으로 저수준 액션 토큰을 생성하고 RL로 다듬는 방법 제안.
SFT만 수행했을 때와 다른 정렬 방법 대비 로봇 공학 벤치마크에서 RL-정렬 모델이 우수함을 보임.
다양한 MLLM 백본과 실제 로봇 작업에 대한 전이 가능성 시연.

제안 방법

MLLM 백본 위에 저수준 액션 생성을 위한 FAST 토큰 생성을 가능하게 하기 위해 SFT를 사용합니다.
확산 기반 액션 헤드를 부착하고 RoboAlign VQA 및 추론 데이터를 포함한 데이터 세트 혼합으로 학습합니다.
RL 루프에서 액션 정확도 보상과 함께 저수준 액션 토큰 정확도를 최적화하기 위해 GRPO를 적용합니다.
2단계에서 <think>...</think>를 사용하여 명시적 추론을 장려하고 형식 및 정확성 보상을 극대화합니다.
LIBERO, CALVIN 및 실제 로봇 설정 전반에 걸쳐 평가하고 언어 기반 RL, 시각적 궤적 RL, SFT 기반 기준선과 비교합니다.

Figure 1 : Performance on LIBERO. VLAs built upon MLLMs specialized for embodied reasoning (fine-tuned variants of Qwen2.5-VL-7B-Instruct) fail to significantly improve performance and often degrade it compared to the baseline VLA based on the original model. In contrast, RoboAlign achieves signific

실험 결과

연구 질문

RQ1RoboAlign이 시뮬레이션 및 실제 로봇 벤치마크에서 일관되게 VLA 성능을 향상시키나요?
RQ2저수준 액션에 대한 RL 기반 정렬이 고수준 언어 또는 2D 궤적 정렬보다 더 효과적인가요?
RQ3RoboAlign이 일반적인 MLLM 체현 추론 및 실제 세계 일반화에 보존되거나 향상되나요?
RQ4다른 MLLM 백본(Qwen2.5VL-7B-Ins, Qwen3VL-8B-Ins 등)으로 RoboAlign이 일반화되나요?

주요 결과

RoboAlign은 SFT 기준선 대비 상당한 VLA 이득을 제공: 17.5% (LIBERO), 18.9% (CALVIN), 및 106.6% (실세계) 를 RL 데이터의 <1% 미만>으로 달성합니다.
저수준 액션에 대한 RL 기반 정렬은 LIBERO의 장기 horizon 작업에서 고수준 언어 RL 및 2D 궤적 RL보다 우수합니다.
RL 정렬은 실제 로봇 성능을 향상시키고 다양한 MLLM 백본으로 일반화합니다.
RoboAlign은 체현 추론 표현을 향상시키며 KNN 정확도(69.79% 대 39.06%)가 더 높게 나타납니다.
SFT 기반 정렬(ECoT)은 성능을 저하시키는 반면 RoboAlign의 RL 기반 접근은 일반 MLLM 능력을 유지하거나 향상시킵니다.
RoboAlign은 체현 추론 벤치마크에서 최신 성능을 달성하면서 일반 MLLM 능력을 보존합니다.

Figure 2 : Overview of RoboAlign framework. RoboAlign directly aligns MLLM representations with low-level action generation using reasoning-incentivized reinforcement learning ( guo2025deepseek ) . The framework consists of two stages: (i) Stage 1 integrates embodied reasoning, zero-shot reasoning,

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.