[论文解读] Go-Explore: a New Approach for Hard-Exploration Problems
Go-Explore 引入一种两阶段算法,记住有前景的状态,先回到这些状态而不进行探索,然后从那里进行探索,最后通过模仿学习实现稳健化,在难以探索的任务上达到超越人类的 Atari 表现。
A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma's Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems. On Montezuma's Revenge, Go-Explore scores a mean of over 43k points, almost 4 times the previous state of the art. Go-Explore can also harness human-provided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma's Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of "superhuman" performance. On Pitfall, Go-Explore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).
研究动机与目标
- 解决稀疏/欺骗性奖励情景下的探索难题。
- 在对探索要求高的 Atari 基准上提升性能,同时尽量减少对内在动机的依赖。
- 开发一个两阶段框架:阶段1 通过包含有前景状态的档案进行探索,阶段2 通过模仿学习实现稳健化。
提出的方法
- 存储有前景的状态(单元格)及其到达方式。
- 在每一步,选择一个单元格,确定性地回到该点,然后从该点使用随机动作进行探索。
- 在发现新单元格或出现更好轨迹时更新档案,包含轨迹、状态、得分和长度。
- 对单元格使用两种表示:一种是简单的、无领域知识的降采样灰度图像(11x8),另一种是领域知识增强表示(例如代理位置、房间、钥匙等)。
- 阶段2使用模仿学习(Backwards Algorithm)通过从轨迹末端附近开始训练并逐步向前移动到起点,结合 PPO,直到达到或超过原始得分来稳健化阶段1的轨迹。
- 该方法将确定性训练(阶段1)与随机评估区分开来,使得可以确定性重置,然后在阶段2加入随机性以提高稳健性。
实验结果
研究问题
- RQ1在探索前显式建立先前访问过的状态档案并回溯到它们,是否能在硬探索任务中提升性能?
- RQ2将确定性探索与随后通过模仿学习实现稳健化相结合,是否在稀疏/欺骗性奖励情境下带来可扩展的改进?
- RQ3单元格表示中的领域知识在多大程度上加速了在硬探索基准中的发现和性能?
- RQ4Go-Explore 如何解决传统内在动机方法所致的分离/偏离问题?
- RQ5在 Montezuma’s Revenge 与 Pitfall 上,使用有/无领域知识的性能提升有多大?
主要发现
- 在 Montezuma’s Revenge 上,若没有领域知识,Go-Explore 得分超过 43,000 分(几乎是此前最佳的 4 倍)。
- 通过易于提供的领域知识,Go-Explore 的平均分超过 650,000 分,最大分超过 1800 万,远超人类世界记录。
- 在 Pitfall 上,具备领域知识的 Go-Explore 的平均分约为 59,494,最大分 107,363,接近游戏可能的最大值。
- 没有领域知识时,Montezuma’s Revenge 的平均分为 43,763,仍然远高于以往工作。
- Go-Explore 能自动生成适用于模仿学习的高性能演示,超越依赖人类演示的先前模仿学习结果。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。