QUICK REVIEW

[论文解读] Scaling World Model for Hierarchical Manipulation Policies

Qian Long, Yueze Wang|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文提出一个分层的视觉-语言-行动（VLA）框架，使用大规模预训练世界模型作为高层规划者生成与视觉相关的子目标图像，指导低层的视觉-语言-行动策略以提升在分布外（OOD）场景中的泛化能力。

ABSTRACT

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level policies, making it feasible to generalize across unseen objects and novel scenarios. We validate both visual goal synthesis and our hierarchical VLA policies in massive out-of-distribution scenarios, and the performance of the same-structured VLA in novel scenarios could boost from 14% to 69% with the guidance generated by the world model. Results demonstrate that our method outperforms previous baselines with a clear margin, particularly in out-of-distribution scenarios. Project page: \href{https://vista-wm.github.io/}{https://vista-wm.github.io}

研究动机与目标

在数据稀缺、分布外条件下，为视觉-语言-行动（VLA）机器人操作实现鲁棒泛化提供动力。
提出一个分层体系结构，将规划（世界模型）与执行（VLA策略）解耦。
利用合成的目标图像作为有视觉和物理基础的子目标，以引导低层策略超越原始文本目标。

提出的方法

提出一个分层的视觉-语言-行动框架，其中世界模型充当高层规划者，VLA策略充当低层执行者。
高层世界模型将任务分解为子任务序列，以目标图像作为目标。
低层VLA策略遵循文本和视觉指导，生成行动序列。
合成的目标图像提供可视化和物理基础的细节，提升对未见对象和情景的泛化。
在大规模分布外情景中评估视觉目标合成与分层策略。

实验结果

研究问题

RQ1一个分层的VLA框架能否在操作任务的分布外情景中提升泛化能力？
RQ2以合成的、基于视觉和物理 grounding 的子目标图像来引导低层策略，是否优于仅使用原始文本目标？
RQ3世界模型引导的子目标合成在未见对象和情景上能提升低层VLA的性能多少？

主要发现

在世界模型合成的子目标引导下，同结构的VLA策略在新情景中相较基线表现出显著提升。
在分布外情景中，受到世界模型引导的性能从14%提升到69%。
所提方法在OOD条件下对比之前的基线具有明显的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。