Skip to main content
QUICK REVIEW

[论文解读] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Haoran Tang, Rein Houthooft|arXiv (Cornell University)|Nov 15, 2016
Reinforcement Learning in Robotics参考文献 35被引用 343
一句话总结

本文通过哈希将经典的基于计数的探索扩展到高维空间(静态哈希和学习哈希),并在连续控制和 Atari 基准测试中展示了接近最先进水平的结果。

ABSTRACT

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

研究动机与目标

  • 为高维空间中的深度强化学习提供鲁棒探索的动机。
  • 利用哈希将基于计数的探索推广到连续/复杂状态空间。
  • 在具有挑战性的深度强化学习基准上评估简单的基于哈希的探索。
  • 分析使基于哈希的探索有效的因素(粒度、相关信息)。
  • 提供一个快速、灵活的基线,兼容常见的 DRL 算法。

提出的方法

  • 用哈希函数对状态空间进行离散化,得到计数 n(φ(s)),并在奖励中加入 β / sqrt(n(φ(s)))。
  • 将 SimHash 作为对连续状态的实用、可扩展哈希方法(算法1)。
  • 通过训练自编码器生成二进制码来探索学习型哈希,然后应用 SimHash 获得 φ(s)(算法2)。
  • 可选地用领域知识(如 BASS 等静态特征)对状态进行预处理,或使用学习表示来提高哈希质量。
  • 在 rllab 连续控制基准和 Atari 2600 游戏上,用 TRPO 进行评估,并与基线和先前的探索方法进行比较。

实验结果

研究问题

  • RQ1通过哈希的基于计数的探索是否能在连续控制和 Atari 基准测试上提升性能?
  • RQ2静态哈希码与学习哈希码在基于图像的观测中对探索性能有何影响?
  • RQ3哪些哈希函数属性(粒度、信息编码)对有效探索贡献最大?
  • RQ4基于哈希的探索与最先进的深度强化学习探索方法相比如何?

主要发现

  • 基于哈希的探索在多个基准上实现接近状态-of-the-art 的性能。
  • 静态 SimHash 和学习哈希(AE-SimHash)在多款 Atari 游戏和连续控制任务上可超越基线 TRPO。
  • 领域相关的预处理(BASS)或学习的哈希码可带来显著改善,尤其是在 Montezuma’s Revenge 和 Venture 这类游戏上。
  • 简单的哈希方法也能提供强探索信号,无需复杂的内在动机机制。
  • 该方法快速、灵活,且可与现有 DRL 算法互补。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。