QUICK REVIEW

[论文解读] WorldCompass: Reinforcement Learning for Long-Horizon World Models

Zehan Wang, Tengfei Wang|arXiv (Cornell University)|Feb 9, 2026

Human Pose and Action Recognition被引用 0

一句话总结

WorldCompass 是一个 RL 后训练框架，通过剪辑级滚动、行动跟随与视觉质量的互补奖励，以及对负反馈的微调，在 WorldPlay 上获得显著提升。

ABSTRACT

This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

研究动机与目标

推动基于视频的世界模型的后训练，以提升长时域交互保真度。
开发面向自回归视频生成与交互信号的 RL 框架。
提高长序列的探索效率与奖励信号粒度。
在平衡行动跟随准确性与视觉质量的同时，缓解奖励操纵。

提出的方法

引入剪辑级滚动，生成并评估目标剪辑的多条滚动，同时复用前缀。
设计两种互补奖励函数：交互跟随准确性与视觉质量（HPSv3）。
使用对负反馈敏感的微调 RL 算法，对基于扩散的视频模型进行效率优化。
采用 Best-of-N 采样和类似课程的渐进目标剪辑索引以稳定训练。
避免 KL 正则化；依赖 EMA 更新和较低的学习率实现稳定优化。

实验结果

研究问题

RQ1后训练的 RL 是否能够提升自回归、可交互的长时域世界模型，相较预训练？
RQ2剪辑级滚动和双重奖励是否比序列级奖励提供更细粒度、更具信息性的反馈？
RQ3如何在不过拟合或奖励操纵的情况下，使用 RL 高效训练扩散式世界模型？
RQ4WorldCompass 是否能在不同的 WorldPlay 变体和动作复杂度下实现泛化？

主要发现

WorldCompass 在短期、中期和长期的交互准确性，以及基础动作与复合动作上的表现均有显著提升。
对于复杂的复合动作，WorldCompass 将动作准确性从 ~20% 提升至 ~55%。
对于基本动作，动作准确性提高了约 10 个百分点。
视觉质量（HPSv3）也随 WorldCompass 提升，表明保真度和与提示的对齐性更好。
剪辑级滚动在推动行动跟随和视觉质量方面均优于样本级滚动。
效率策略（Best-of-N、时间步子采样、渐进剪辑长度）在不牺牲性能的前提下将训练时间降低最高约 ~50%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。