[论文解读] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning
本文通过哈希将经典的基于计数的探索扩展到高维空间(静态哈希和学习哈希),并在连续控制和 Atari 基准测试中展示了接近最先进水平的结果。
Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.
研究动机与目标
- 为高维空间中的深度强化学习提供鲁棒探索的动机。
- 利用哈希将基于计数的探索推广到连续/复杂状态空间。
- 在具有挑战性的深度强化学习基准上评估简单的基于哈希的探索。
- 分析使基于哈希的探索有效的因素(粒度、相关信息)。
- 提供一个快速、灵活的基线,兼容常见的 DRL 算法。
提出的方法
- 用哈希函数对状态空间进行离散化,得到计数 n(φ(s)),并在奖励中加入 β / sqrt(n(φ(s)))。
- 将 SimHash 作为对连续状态的实用、可扩展哈希方法(算法1)。
- 通过训练自编码器生成二进制码来探索学习型哈希,然后应用 SimHash 获得 φ(s)(算法2)。
- 可选地用领域知识(如 BASS 等静态特征)对状态进行预处理,或使用学习表示来提高哈希质量。
- 在 rllab 连续控制基准和 Atari 2600 游戏上,用 TRPO 进行评估,并与基线和先前的探索方法进行比较。
实验结果
研究问题
- RQ1通过哈希的基于计数的探索是否能在连续控制和 Atari 基准测试上提升性能?
- RQ2静态哈希码与学习哈希码在基于图像的观测中对探索性能有何影响?
- RQ3哪些哈希函数属性(粒度、信息编码)对有效探索贡献最大?
- RQ4基于哈希的探索与最先进的深度强化学习探索方法相比如何?
主要发现
- 基于哈希的探索在多个基准上实现接近状态-of-the-art 的性能。
- 静态 SimHash 和学习哈希(AE-SimHash)在多款 Atari 游戏和连续控制任务上可超越基线 TRPO。
- 领域相关的预处理(BASS)或学习的哈希码可带来显著改善,尤其是在 Montezuma’s Revenge 和 Venture 这类游戏上。
- 简单的哈希方法也能提供强探索信号,无需复杂的内在动机机制。
- 该方法快速、灵活,且可与现有 DRL 算法互补。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。