QUICK REVIEW

[논문 리뷰] Scaling World Model for Hierarchical Manipulation Policies

Qian Long, Yueze Wang|arXiv (Cornell University)|2026. 02. 11.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 고수준 플래너로 대규모 사전학습 세계 모델을 활용하여 시각적으로 근거가 있는 하위 목표 이미지를 생성하고, 이를 통해 일반화 성능을 향상시키는 계층형 Vision-Language-Action 프레임워크를 제시한다. 이 프레임워크는 저수준 Vision-Language-Action 정책을 유도하여 분포 밖(out-of-distribution) 시나리오에서의 일반화를 향상시킨다.

ABSTRACT

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

연구 동기 및 목표

데이터 부족 및 분포 밖 조건에서 vision-language-action (VLA) 로봇 조작의 견고한 일반화를 촉진한다.
계획(world model)과 실행(VLA 정책)을 분리하는 계층형 아키텍처를 제안한다.
합성된 목표 이미지를 시각적·물리적으로 구속된 하위 목표로 활용하여 원시 텍스트 목표를 넘어 저수준 정책을 안내한다.

제안 방법

world model이 고수준 플래너로, VLA policy가 저수준 실행자로 작동하는 계층적 Vision-Language-Action 프레임워크를 도입한다.
고수준 world model은 목표 이미지를 타깃으로 하여 작업을 하위 작업 시퀀스로 분해한다.
저수준 VLA policy는 텍스트 및 시각적 지침을 따라 행동 시퀀스를 생성한다.
합성된 목표 이미지는 시각적·물리적으로 구속된 상세 정보를 제공하여 보지 못한 물체와 시나리오에 대한 일반화를 향상시킨다.
대규모의 분포 밖 시나리오에서 시각적 목표 합성과 계층적 정책을 평가한다.

실험 결과

연구 질문

RQ1계층적 VLA 프레임워크가 조작 작업에서 분포 밖 시나리오의 일반화를 개선할 수 있는가?
RQ2합성된 목표 이미지를 시각 및 물리적으로 구속된 하위 목표로 사용하는 것이 저수준 정책을 안내하는 데 원시 텍스트 목표보다 더 우수한가?
RQ3월드-모델 기반의 하위 목표 합성이 보지 못한 물체와 시나리오에서 저수준 VLA의 성능을 얼마나 높일 수 있는가?

주요 결과

월드-모델이 합성한 하위 목표를 통해 가이드되는 같은 구조의 VLA 정책은 새로운 시나리오에서도 기본 Baselines에 비해 상당한 성능 향상을 보인다.
world-model 지도로 성능이 OOD 시나리오에서 14%에서 69%로 향상된다.
제안된 방법은 특히 OOD 조건에서 이전의 Baselines보다 명확한 Margin으로 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.