QUICK REVIEW

[論文レビュー] Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

Songen Gu, Yunuo Cai|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

論文は、Dream4manip という三段階フレームワークを提示します。高忠実度のビデオ世界モデルを選択/適応し、推論を高速化するために蒸留を行い、イン-context 条件付き行動モデルを用いて長期的な指示駆動ロボット操作を実現します。 embodiment（具現化）、空間的精度、タスク成功の向上を目指します。

ABSTRACT

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs) provide high-level guidance, they cannot explicitly forecast future states, and existing world models either predict only short horizons or produce spatially inconsistent frames. To address these challenges, we propose a framework for fast and predictive video-conditioned action. Our approach first selects and adapts a robust video generation model to ensure reliable future predictions, then applies adversarial distillation for fast, few-step video generation, and finally trains an action model that leverages both generated videos and real observations to correct spatial errors. Extensive experiments show that our method produces temporally coherent, spatially accurate video predictions that directly support precise manipulation, achieving significant improvements in embodiment consistency, spatial referring ability, and task completion over existing baselines. Codes & Models will be released.

研究の動機と目的

長期的なロボット操作のための堅牢な予測世界モデルを動機付け、環境の変化を予測する。
高忠実度のビデオ予測とリアルタイム制御に適した高速推論を組み合わせる。
in-context 条件付き行動モデルを介して想定未来軌道を実世界の観測へと結びつけ、空間的誤差を訂正し実行可能な行動を生成する。

提案手法

embodiment 一貫性とタスク指標に基づき world-model バックボーンとして Cosmos-Predict2 を選択する。
フィデリティを保ちつつ少段階のデノイズを可能にするためのドメイン適応と潜在空間対抗蒸留を適用する。
軌道を固定されたキーフレーム集合に圧縮して長さ非依存の想像を導入し、ビデオ拡散モデルをファインチューニングする。
想像された軌道と実観測の両方を用いて空間的誤差を修正し実行可能な行動を生成するin-context 条件付き行動モデルを開発する。

Figure 1 : (a) The training pipeline for world model distillation. Additional details are provided in Section 3.1.2 . (b) The pipeline of our proposed policy model. Given the current observation and the instruction, the world model first generates imagined future frames. The in-context conditioned a

実験結果

リサーチクエスチョン

RQ1高忠実度のビデオ世界モデルをロボット領域へ適応し、空間・時間の忠実性を損なうことなく速い推論を蒸留できるか。
RQ2長さ非依存の想像は manipulation タスクの長期的予測を信頼できるものにできるか。
RQ3in-context 条件付き行動モデルは想像された未来を実観測で結びつけて行動品質とタスク成功を改善できるか。
RQ4予測世界モデルとin-context 行動方針を統合することで embodiment 一貫性、空間参照能力、タスク完遂における利得はどの程度になるか。

主な発見

Model	EC	RSR	ISR	TSR
Cosmos-2B	1.58	96.00	86.00	70.00
Cosmos-14B	1.62	90.00	82.00	76.00
Cosmos-14B-Droid	1.58	78.00	58.00	34.00
Wan-14B	1.56	54.00	44.00	20.00

Dream4manip は LIBERO ベンチマークにおける長期指向操作で最先端の性能を達成。
Dream4manip は LIBERO 全体成功率 98.2%、強い一般化能力を示す：LIBERO-空間で 99.4%、LIBERO-物体で 99.2%、LIBERO-目標で 98.6%、LIBERO-長で 95.4%。
ドメイン適応と潜在対抗蒸留はビデオ品質を改善し、少段階推論を効率化（C-2B-DA+Dis: SSIM 0.84、PSNR 26.82、FVD 238.09）。
ゼロショットの空間参照能力は Cosmos モデルで最も強く、ドメイン適応された Cosmos はタスク間で頑健な embodiment 一貫性を達成。
Cosmos-2B/14B 系は適応前に空間参照・操作で優れており、ドメイン適応版は時間的一貫性と忠実度を向上。

Figure 2 : The structure of the in-context conditioned action model. We use a transformer-based backbone inherited from ACT (Zhao et al. , 2023 ) for our action model, separated vision encoder is assembled to process videos and observations. The model will output an action chunk for each observation

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。