QUICK REVIEW

[论文解读] Observe and Look Further: Achieving Consistent Performance on Atari

Tobias Pohlen, Bilal Piot|arXiv (Cornell University)|May 29, 2018

Reinforcement Learning in Robotics参考文献 18被引用 85

一句话总结

本论文介绍 Ape-X DQfD，一种带有变换 Bellman 操作符、时间一致性损失和示范的分布式 DQN 变体，在 40/42 的 Atari 游戏上达到人类水平，并在第一关解决 Montezuma’s Revenge。

ABSTRACT

Despite significant advances in the field of deep Reinforcement Learning (RL), today's algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games. A new transformed Bellman operator allows our algorithm to process rewards of varying densities and scales; an auxiliary temporal consistency loss allows us to train stably using a discount factor of $γ= 0.999$ (instead of $γ= 0.99$) extending the effective planning horizon by an order of magnitude; and we ease the exploration problem by using human demonstrations that guide the agent towards rewarding states. When tested on a set of 42 Atari games, our algorithm exceeds the performance of an average human on 40 games using a common set of hyper parameters. Furthermore, it is the first deep RL algorithm to solve the first level of Montezuma's Revenge.

研究动机与目标

识别在多样化的 Atari 游戏中实现人类水平性能的关键挑战（奖励分布、长远推理、探索）。
开发一个稳定的学习算法，能够在不改变最优策略的前提下处理未截断的奖励和高折扣因子。
在分布式 RL 框架中利用专家演示以提升探索和样本效率。
在大规模 Atari 套件上展示性能优于先前的 DQN 变体，包括稀疏奖励游戏。

提出的方法

引入变换的 Bellman 操作符，以在不截断奖励的情况下降低目标方差。
利用辅助的时间一致性（TC）损失，在高折扣因子 gamma=0.999 的情况下实现稳定学习。
将 Ape-X 分布式经验回放与 Demonstrations 的 Deep Q-learning from Demonstrations (DQfD) 结合，以将在线代理数据与专家演示融合。
仅在最佳专家轨迹上应用模仿损失，同时在训练期间保持固定的演员–专家数据混合。
提供消融研究，以量化变换算子、TC 损失和演示的贡献。

实验结果

研究问题

RQ1变换的 Bellman 操作符是否能够在不截断奖励的情况下，在多样化的奖励尺度上稳定 Q 学习？
RQ2时间一致性损失是否能够在 gamma 接近 1 时实现稳定学习和有效的计划视野？
RQ3在分布式 DQN 框架中纳入演示对 AtarI 游戏的性能和探索有何影响？
RQ4所提出的方法在像 Montezuma’s Revenge 和 Pitfall! 这样的稀疏奖励游戏上在多大程度上提升了性能？

主要发现

该算法在 42 款 Atari 游戏中的 40 款上，使用相同的超参数超过了平均人类水平。
它是首次完成 Montezuma’s Revenge 第一关的深度 RL 方法。
结合 TC 损失使用更高的折扣因子 gamma=0.999，带来扩展的规划远景和稳定学习。
将变换的 Bellman 操作符、TC 损失和演示结合起来，与基线相比带来更好的一致性和跨游戏的性能提升。
使用 gamma=0.999 的更深层网络架构进一步提高结果，达到 40/42 款超过平均人类水平。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。