QUICK REVIEW

[論文レビュー] Scaling World Model for Hierarchical Manipulation Policies

Qian Long, Yueze Wang|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は、視覚 grounding subgoal 画像を生成する高レベルの計画者として大規模事前学習世界モデルを使用する階層型 Vision-Language-Action フレームワークを提案し、低レベルの Vision-Language-Action ポリシーの一般化をアウト・オブ・ディストリビューション（OOD）状況で向上させる。

ABSTRACT

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

研究の動機と目的

データ不足・OOD 条件下での Vision-Language-Action（VLA）ロボット操作の頑健な一般化を動機づける。
計画（世界モデル）を実行（VLA ポリシー）から分離する階層的アーキテクチャを提案。
合成されたゴール画像を視覚的・物理的に grounded なサブゴールとして活用し、低レベルポリシーを raw テキストゴール以上に指示する。

提案手法

世界モデルが高レベルのプランナー、VLA ポリシーが低レベルの実行者として機能する階層型 Vision-Language-Action フレームワークを導入。
高レベルの世界モデルはゴール画像をターゲットとしてタスクをサブタスク列に分解。
低レベルの VLA ポリシーはテキストと視覚のガイダンスに従いアクション列を生成。
合成ゴール画像は視覚的・物理的に grounded な詳細を提供し、未知の物体や状況への一般化を向上。
巨大なOOD 状況で視覚ゴール合成と階層ポリシーを評価。

実験結果

リサーチクエスチョン

RQ1階層型 VLA フレームワークは manipulatIon タスクにおけるOOD状況での一般化を改善できるか？
RQ2視覚と物理に grounded なサブゴールとして合成ゴール画像を用いることは、低レベルポリシーを導く際に raw テキストゴールより優れているか？
RQ3世界モデル主導のサブゴール合成は未知の objects/状況での低レベル VLA の性能をどれだけ向上させるか？

主な発見

世界モデルによって合成されたサブゴールに guided された、未知の状況で同一構造の VLA ポリシーは、ベースラインより顕著な性能向上を示す。
世界モデルの誘導があると、OOD 状況での性能が14%から69%へ改善。
提案手法は、特にOOD条件下で従来のベースラインを明確なマージンで上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。