QUICK REVIEW

[论文解读] Unifying Count-Based Exploration and Intrinsic Motivation

Marc G. Bellemare, Sriram Srinivasan|arXiv (Cornell University)|Jun 6, 2016

Reinforcement Learning in Robotics参考文献 50被引用 246

一句话总结

本文提出基于密度模型推导的伪计数，将基于计数的探索推广到非表格设置，并将其与信息增益联系起来，展示在包括 Montezuma’s Revenge 的 Atari 2600 游戏中的探索改进。

ABSTRACT

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into intrinsic rewards and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma's Revenge.

研究动机与目标

激发非表格强化学习中探索问题，并指出传统基于计数的方法的局限性。
提出一种基于密度模型的机制来推导伪计数，使计数在状态间得到泛化。
建立伪计数、预测增益和信息增益之间的理论联系。
在 Atari 2600 游戏（包括 Montezuma’s Revenge）以及在 actor-critic 与回放设置中证明伪计数奖金的实际有效性。

提出的方法

通过将模型的当前概率与重新编码后的概率通过 rho_n 和 rho'_n 联系起来，定义来自密度模型的伪计数。
利用重新编码的概率来推导伪计数 N_hat_n(x)，使经验计数 N_n(x)能在非表格空间中实现泛化。
将伪计数与信息增益和预测增益联系起来，证明 IG_n(x) ≤ PG_n(x) ≤ N_hat_n(x)^{-1}，以及 PG_n(x) ≤ N_hat_n(x)^{-1/2}。
在 MBIE-EB 风格的规划以及 DQN/A3C 框架中应用伪计数基础的探索奖励 R^+_n(x,a) = β (N_hat_n(x) + 0.01)^{-1/2}。
在一个简单的 Atari 示例（Freeway）上验证伪计数的性质，并在使用 CTS 密度模型对像素进行建模的情况下，将实验扩展到 Atari 2600 的游戏。

实验结果

研究问题

RQ1基于密度模型的伪计数是否能将访问计数推广到非表格状态空间？
RQ2伪计数如何与信息增益和预测增益相关，以及它们是否可以为探索提供理论保证？
RQ3基于伪计数的奖金是否在包括 Montezuma’s Revenge 在内的困难 Atari 游戏中提高探索，在基于值与基于策略的 RL 方法中？

主要发现

伪计数为非表格环境中的状态新颖性提供了一个有意义、可泛化的概念。
预测增益近似信息增益，并通过与伪计数的关系对探索奖金给出界限。
伪计数奖金在困难的 Atari 游戏上显著提升探索，特别是 Montezuma’s Revenge，相较于基线。
将伪计数奖金与 A3C（A3C+）结合，60 个 Atari 游戏的中位表现优于仅使用 A3C。
基于 CTS 的伪计数在给定帧预算内实现更快的探索和在 Montezuma’s Revenge 中获得更高分数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。