Skip to main content
QUICK REVIEW

[论文解读] World-Gymnast: Training Robots with Reinforcement Learning in a World Model

Ansh Kumar Sharma, Yixiang Sun|arXiv (Cornell University)|Feb 2, 2026
Reinforcement Learning in Robotics被引用 0
一句话总结

World-Gymnast 在一个动作条件化的视频世界模型中微调视觉-语言-行动策略,使用视觉-语言奖励模型,在真实机器人性能上优于 SFT 和软件模拟器,具备测试时和迭代改进。

ABSTRACT

Robot learning from interacting with the physical world is fundamentally bottlenecked by the cost of physical interaction. The two alternatives, supervised finetuning (SFT) from expert demonstrations and reinforcement learning (RL) in a software-based simulator, are limited by the amount of expert data available and the sim-to-real gap for manipulation. With the recent emergence of world models learned from real-world video-action data, we ask the question of whether training a policy in a world model can be more effective than supervised learning or software simulation in achieving better real-robot performance. We propose World-Gymnast, which performs RL finetuning of a vision-language-action (VLA) policy by rolling out the policy in an action-conditioned video world model and rewarding the rollouts with a vision-language model (VLM). On the Bridge robot setup, World-Gymnast outperforms SFT by as much as 18x and outperforms software simulator by as much as 2x. More importantly, World-Gymnast demonstrates intriguing capabilities of RL with a world model, including training on diverse language instructions and novel scenes from the world model, test-time training in a novel scene, and online iterative world model and policy improvement. Our results suggest learning a world model and training robot policies in the cloud could be the key to bridging the gap between robots that work in demonstrations and robots that can work in anyone's household.

研究动机与目标

  • 通过从真实世界数据学习得到的世界模型来降低真实机器人数据成本以学习策略。
  • 证明在世界模型中进行强化学习微调能比 SFT 或传统模拟器获得更好的真实世界性能。
  • 实现从任意初始帧、全新语言指令,以及测试时或迭代地对世界模型/策略进行改进的训练。
  • 在 Bridge 机器人任务上展示系统,并通过 AutoEval 进行真实机器人评估。
  • 通过干扰项、全新语言提示和更多任务来探索数据增强与可扩展性。

提出的方法

  • 使用 World-Gymnast 在名为 WorldGym 的动作条件化世界模型内通过 RL 对视觉-语言-行动策略进行微调。
  • 在世界模型中利用当前策略滚动想象轨迹,并从策略中抽样动作。
  • 利用预测帧的视觉-语言模型(VLM)计算二元任务奖励。
  • 使用基于分组归一化(GRPO)的优势估计,并使用带有裁剪的 PPO 式目标函数进行优化。
  • 纳入多样化的训练场景:任意初始帧、全新语言指令和干扰对象以提高鲁棒性。
  • 可选地对世界模型(如同 DynA 式)和策略进行迭代在线更新,使用真实机器人数据来改进回滚。
Figure 1 : Overview of World-Gymnast. The policy is trained on tasks specified by an initial frame and language instruction. During training, the policy outputs actions which are then passed to the world model (WorldGym (Quevedo et al. , 2025 ) ) which generates imagined rollouts. These rollouts are
Figure 1 : Overview of World-Gymnast. The policy is trained on tasks specified by an initial frame and language instruction. During training, the policy outputs actions which are then passed to the world model (WorldGym (Quevedo et al. , 2025 ) ) which generates imagined rollouts. These rollouts are

实验结果

研究问题

  • RQ1在学习的世界模型中训练的策略是否比 SFT 或软件仿真 RL 在真实机器人上表现更好?
  • RQ2World-Gymnast 是否能够在任意初始帧、全新语言指令和对新场景的测试时训练 RL?
  • RQ3迭代的世界模型和策略改进是否进一步缩小仿真-现实差距?
  • RQ4该方法在多种任务以及有干扰项或语言变化时的表现如何?

主要发现

  • World-Gymnast 在真实机器人任务中显著优于 SFT 和软件仿真基线。
  • 在四个 Bridge 任务中,它在三项任务上实现比 SIMPLER 更高的真实机器人成功率,并总体显示显著提升。
  • 加入干扰项与全新语言指令的训练可进一步提高鲁棒性和泛化性(World-Gymnast-Distract、World-Gymnast-Language)。
  • 在 novel frame 上进行测试时训练可提升某任务(Close the drawer)的成功率从 62% 提升至 100%,但也可能在其他任务上带来退化风险。
  • 迭代的世界模型与策略更新(类似 DynA)可提升回滚的真实感与真实世界性能(如 AutoEval 中 Open the drawer 任务达到 95%)。
Figure 2 : Qualitative evaluation of policy rollouts in WorldGym with distractors. We compare rollout quality among SFT, World-Gymnast and World-Gymnast-Distract under visual distractions. The task on the left is put blue cup on plate and the SFT policy clearly picks up the wrong cup, while both Wor
Figure 2 : Qualitative evaluation of policy rollouts in WorldGym with distractors. We compare rollout quality among SFT, World-Gymnast and World-Gymnast-Distract under visual distractions. The task on the left is put blue cup on plate and the SFT policy clearly picks up the wrong cup, while both Wor

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。