QUICK REVIEW

[논문 리뷰] Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

Songen Gu, Yunuo Cai|arXiv (Cornell University)|2026. 02. 11.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 Dream4manip를 제시합니다. 이는 고충실도 비디오 월드 모델을 선택/적응하고, 빠른 소수 스텝 추론을 위해 이를 증류하며, 맥락 내 조건화된 행동 모델을 통해 긴 시간 목표를 지시하는 로봇 조작에서 향상된 구현체성, 공간 정확도 및 작업 성공을 달성하는 세 단계 프레임워크입니다.

ABSTRACT

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.

연구 동기 및 목표

Robust predictive world models for long-horizon robotic manipulation to anticipate environment evolution.
Combine high-fidelity video prediction with fast inference suitable for real-time control.
Ground imagined future trajectories in real observations via an in-context conditioned action model.

제안 방법

Cosmos-Predict2를 구현체 일관성과 작업 메트릭에 근거하여 월드모델 백본으로 선택합니다.
도메인 적응과 잠재공간 적대적 증류를 적용하여 충실도 보존된 상태에서 소수 단계의 디노이징을 가능하게 합니다.
트래젝토리를 고정된 키프레임 세트로 압축하고 비디오 디퓨전 모델을 미세 조정하여 길이에 구애받지 않는 상상을 소개합니다.
상상된 트래젝토리와 실제 관찰을 모두 사용하는 맥락 내 조건화된 행동 모델을 개발하여 공간 오차를 보정하고 실행 가능한 행동을 생성합니다.]

Figure 1 : (a) The training pipeline for world model distillation. Additional details are provided in Section 3.1.2 . (b) The pipeline of our proposed policy model. Given the current observation and the instruction, the world model first generates imagined future frames. The in-context conditioned a

실험 결과

연구 질문

RQ1고충실도 비디오 월드 모델을 로봇 도메인에 적응시키고 빠른 추론으로 증류하는 것이 공간/시간 충실도를 손상시키지 않나요?
RQ2길이 구애 없는 상상이Manipulation 작업에서 신뢰할 수 있는 긴 시간 예측을 가능하게 하나요?
RQ3맥락 내 조건화된 행동 모델이 상상된 미래를 실제 관찰에 근거하여 ground 할 수 있어 행동 품질과 작업 성공을 향상시키나요?
RQ4 predictive 월드 모델과 맥락 내 행동 정책의 통합이 구현체 일관성, 공간 지시 능력, 작업 완수에 어떤 이점을 주나요?

주요 결과

Dream4manip은 LIBERO 벤치마크에서 긴 시간 조작에 대한 최첨단 성능을 달성합니다.
Dream4manip은 LIBERO에서 98.2%의 총 성공률에 도달하고 일반화가 강하게 나타납니다: LIBERO-Spatial에서 99.4%, LIBERO-Object에서 99.2%, LIBERO-Goal에서 98.6%, LIBERO-Long에서 95.4%입니다.
도메인 적응과 잠재적 적대적 증류는 비디오 품질을 향상시키고 효율적인 소수 단계 추론을 가능하게 합니다 (C-2B-DA+Dis: SSIM 0.84, PSNR 26.82, FVD 238.09).
제로샷 공간 지시 능력은 Cosmos 모델에 대해 가장 강하며, 도메인 적응된 Cosmos는 다양한 작업에서 강건한 구현체 일관성을 달성합니다.
도메인 적응 전의 Cosmos-2B/14B 변형은 공간 지시 및 조작에 더 우수하고, 도메인 적응 버전은 시간적 일관성과 충실도를 향상시킵니다.

Figure 2 : The structure of the in-context conditioned action model. We use a transformer-based backbone inherited from ACT (Zhao et al. , 2023 ) for our action model, separated vision encoder is assembled to process videos and observations. The model will output an action chunk for each observation

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.