QUICK REVIEW

[论文解读] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Haoran Tang, Rein Houthooft|arXiv (Cornell University)|Nov 15, 2016

Reinforcement Learning in Robotics参考文献 35被引用 343

一句话总结

本文通过哈希将经典的基于计数的探索扩展到高维空间（静态哈希和学习哈希），并在连续控制和 Atari 基准测试中展示了接近最先进水平的结果。

ABSTRACT

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

研究动机与目标

为高维空间中的深度强化学习提供鲁棒探索的动机。
利用哈希将基于计数的探索推广到连续/复杂状态空间。
在具有挑战性的深度强化学习基准上评估简单的基于哈希的探索。
分析使基于哈希的探索有效的因素（粒度、相关信息）。
提供一个快速、灵活的基线，兼容常见的 DRL 算法。

提出的方法

用哈希函数对状态空间进行离散化，得到计数 n(φ(s))，并在奖励中加入 β / sqrt(n(φ(s)))。
将 SimHash 作为对连续状态的实用、可扩展哈希方法（算法1）。
通过训练自编码器生成二进制码来探索学习型哈希，然后应用 SimHash 获得 φ(s)（算法2）。
可选地用领域知识（如 BASS 等静态特征）对状态进行预处理，或使用学习表示来提高哈希质量。
在 rllab 连续控制基准和 Atari 2600 游戏上，用 TRPO 进行评估，并与基线和先前的探索方法进行比较。

实验结果

研究问题

RQ1通过哈希的基于计数的探索是否能在连续控制和 Atari 基准测试上提升性能？
RQ2静态哈希码与学习哈希码在基于图像的观测中对探索性能有何影响？
RQ3哪些哈希函数属性（粒度、信息编码）对有效探索贡献最大？
RQ4基于哈希的探索与最先进的深度强化学习探索方法相比如何？

主要发现

基于哈希的探索在多个基准上实现接近状态-of-the-art 的性能。
静态 SimHash 和学习哈希（AE-SimHash）在多款 Atari 游戏和连续控制任务上可超越基线 TRPO。
领域相关的预处理（BASS）或学习的哈希码可带来显著改善，尤其是在 Montezuma’s Revenge 和 Venture 这类游戏上。
简单的哈希方法也能提供强探索信号，无需复杂的内在动机机制。
该方法快速、灵活，且可与现有 DRL 算法互补。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。