[論文レビュー] GigaWorld-Policy: An Efficient Action-Centered World--Action Model
GigaWorld-Policy は、未来のアクション列を学習するアクション中心の World–Action モデルを訓練し、任意の未来映像予測を可能にすることで、従来の WAM ベースラインと比較して推論を 9 倍高速化し、現実世界の成功率を向上させる。
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
研究の動機と目的
- Motivate and address inefficiencies in World–Action Models (WAM) for robotic policy learning.
- Develop an action-centered WAM that decouples action decoding from video generation to enable low-latency control.
- Leverage future visual dynamics as dense supervision during training while keeping future video generation optional at inference.
提案手法
- Use a single unified diffusion Transformer backbone pretrained on a large-scale video model as the World–Action model.
- Represent multi-view observations as a single composite image to enable cross-view reasoning.
- Employ a causal self-attention scheme that separates action tokens, state tokens, observation tokens, and future-video tokens, ensuring future video cannot influence action generation.
- Train with flow-matching objectives to jointly optimize action prediction and visual feedforward dynamics modeling.
- Pre-train with embodied video data (robot and egocentric human videos) and post-train on target-robot trajectories to specialize the policy.
- At inference, perform action decoding with optional video generation, enabling low-latency, closed-loop control.
実験結果
リサーチクエスチョン
- RQ1Can an action-centered WAM improve policy learning efficiency by using future visual dynamics as supervision without requiring constant video rollouts at inference?
- RQ2Does decoupling action decoding from future video generation reduce inference latency without sacrificing task performance?
- RQ3How does embodied pre-training on robot and egocentric data affect robustness and sample efficiency for real-world robot tasks?
- RQ4What is the impact of predicting different numbers of future frames on policy performance in real-world settings?
- RQ5How does GigaWorld-Policy compare to state-of-the-art WAM and VLA baselines in simulation and real-world deployments?
主な発見
- GigaWorld-Policy achieves about a 9x speedup in inference time over Motus while maintaining competitive task performance.
- In real-world experiments, GigaWorld-Policy improves average success rate by about 7% over Motus under similar speed conditions.
- Compared with a Vision–Language–Action baseline (π0.5), GigaWorld-Policy improves performance by around 95% on RoboTwin 2.0 under comparable speed.
- Pre-training with embodied data and a large video backbone yields substantial gains in real-world success, with full embodied pre-training plus video initialization outperforming ablations.
- The model reaches higher real-world success with more predicted future frames (up to a point), demonstrating the benefit of feed-forward dynamics modeling in action understanding.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。