QUICK REVIEW

[论文解读] STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning

Weipu Zhang, Gang Wang|arXiv (Cornell University)|Oct 14, 2023

Reinforcement Learning in Robotics被引用 8

一句话总结

STORM 引入一种基于 Transformer 的随机世界模型，配合 VAE 编码器，以提升在 Atari 100k 的样本效率与训练速度，在不使用前瞻搜索的情况下达到新的平均人类归一化分数，并实现更快的实时训练。

ABSTRACT

Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments. These approaches begin by constructing a parameterized simulation world model of the real environment through self-supervised learning. By leveraging the imagination of the world model, the agent's policy is enhanced without the constraints of sampling from the real environment. The performance of these algorithms heavily relies on the sequence modeling and generation capabilities of the world model. However, constructing a perfectly accurate model of a complex unknown environment is nearly impossible. Discrepancies between the model and reality may cause the agent to pursue virtual goals, resulting in subpar performance in the real environment. Introducing random noise into model-based reinforcement learning has been proven beneficial. In this work, we introduce Stochastic Transformer-based wORld Model (STORM), an efficient world model architecture that combines the strong sequence modeling and generation capabilities of Transformers with the stochastic nature of variational autoencoders. STORM achieves a mean human performance of $126.7\%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods that do not employ lookahead search techniques. Moreover, training an agent with $1.85$ hours of real-time interaction experience on a single NVIDIA GeForce RTX 3090 graphics card requires only $4.3$ hours, showcasing improved efficiency compared to previous methodologies.

研究动机与目标

提升视觉环境中模型基强化学习的样本效率。
开发一个有效的世界模型，利用 Transformer 与随机潜在表示。
在 Atari 100k 上保持或提升性能的同时，降低预测误差累积与训练时间。

提出的方法

使用分类 VAE 编码器将观测映射到随机潜在变量 z_t（32 类 × 32 类）。
将 z_t 和动作 a_t 合并为一个单一令牌 e_t，并喂入类似 GPT 的 Transformer 作为序列模型以产生 h_t。
从 h_t 通过 MLP 头预测奖励、继续标志以及下一个潜在分布。
用自监督损失训练世界模型，损失包含重建、奖励、继续、动态（KL）及表征（KL）项（带 beta 权重）。
通过 DreamerV3 风格的 actor-critic 目标、lambda-return 和 KV-cache 加速推理，仅从想象经验中学习智能体策略。

实验结果

研究问题

RQ1基于随机 Transformer 的世界模型是否在 Atari 100k 上优于基于 RNN 或 Transformer-XL 的模型？
RQ2每张图像是否单一的随机潜在表示就能有效捕捉用于策略学习的动态？
RQ3所提损失设计及以想象为基础的学习对样本效率与计算效率有何影响？
RQ4世界模型设计选择（编码器类型、状态表示、transformer 深度）对性能有何影响？
RQ5在有限的真实环境交互下，使用 STORM 是否可实现高性能？

主要发现

STORM 在 Atari 100k 上实现平均人类归一化分数 126.7%，在不使用前瞻搜索的方法中创下新纪录。
在 RTX 3090 上用大约 1.85 小时的真实数据进行训练，大约需要 4.3 小时，显示出对先前方法的效率提升。
相比 SimPLe、TWM、IRIS、DreamerV3，STORM 得益于 Transformer 序列建模与随机潜在表示，在具有较大奖励相关对象的游戏中获得更好表现。
消融实验显示，使用 Transformer 作为序列模型并采用单一随机潜在与观测-动作合并令牌是有效的，而更大深度的 Transformer 未必在 Atari 100k 上提升结果。
引入单一演示轨迹可在稀疏奖励游戏（如 Pong）中提升探索，但在密集奖励游戏（如 Ms. Pacman）可能带来不利影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。