[论文解读] Genie: Generative Interactive Environments
Genie 是一个基础世界模型,以无监督方式从未标记的互联网视频中训练,使得能够逐帧、受动作控制的交互式环境,能够由图像、草图、文本或提示激发,具有 11B 参数模型。
We introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.
研究动机与目标
- 推动并实现从提示生成可交互、可玩的虚拟世界,且无需 ground-truth 行动标签。
- 开发一种可扩展的、模块化的体系结构,从视频数据中学习潜在行动空间。
- 展示逐帧可控性以及对未见提示和领域的泛化能力。
- 探索从视频中学习的潜在动作在支持训练通用代理(Open Ended Learning)方面的潜力。
提出的方法
- Three main components: a video tokenizer (VQ-VAE-based) to tokenise frames into discrete tokens; a latent action model (LAM) that learns a small discrete set of latent actions in an unsupervised manner; a dynamics model (MaskGIT-based) that autoregressively predicts future frame tokens conditioned on past tokens and latent actions.
- The architecture uses spatiotemporal (ST) transformers across components to handle video data efficiently; a causal mask enables processing entire sequences for latent action inference and future frame prediction.
- Training is performed in two phases: first train the video tokenizer, then jointly train the latent action model and the dynamics model on video tokens.
- The latent action space is discretized with a small VQ codebook (|A|=8) to ensure controllability and human-playability.
- Experiments are conducted on Platformers video data (≈30k hours) and robotics videos (RT1), with evaluation using Frechet Video Distance (FVD) and a controllability metric Delta_t-PSNR.
实验结果
研究问题
- RQ1Can a large-scale, unsupervised model learn a usable latent action space from unlabelled videos?
- RQ2Can Genie generate diverse, controllable interactive environments from prompts such as images or sketches?
- RQ3Do latent actions learned from internet videos transfer to unseen prompts and to robotics domains?
- RQ4Is the approach scalable in model size and data, and can it support potential use as a foundation model for generalist agents?
主要发现
- Genie trains an 11B-parameter model (with tokenizer and latent action model bringing total to 11.0B parameters; larger website variant mentioned) that can generate interactive environments from prompts.
- The Platformers-trained model (11B) achieves qualitative and quantitative results including strong controllability across prompts, including out-of-distribution image prompts (e.g., hand-drawn sketches, real photos, Imagen2 prompts).
- The Robotics-trained model (2.5B parameters) learns consistent latent actions (e.g., down, up, left) without action labels and demonstrates object interactions and deformable objects handling.
- Quantitative metrics show convergence in scaling experiments; increasing model size and batch size yields lower training loss, with reported FVD and Delta_t-PSNR trends indicating improved fidelity and controllability as scale increases.
- Genie achieves a Frechet Video Distance (FVD) of 82.7 on the Robotics test set and demonstrates consistent latent-action behavior across multiple starting frames.
- The approach enables using latent actions learned from internet videos to imitate policies in unseen RL environments, with evidence that a small amount of expert data can map latent actions to real actions for policy cloning.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。