QUICK REVIEW

[论文解读] Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Joshua Achiam, S. Shankar Sastry|arXiv (Cornell University)|Mar 6, 2017

Reinforcement Learning in Robotics参考文献 14被引用 100

一句话总结

本文提出基于 surprisal 的内在奖励，通过学习的转移模型来驱动深度强化学习中的探索，使用 surprisal 和 k-step learning progress 作为激励，在连续控制与 Atari RAM 任务中展示了改进的探索。

ABSTRACT

Exploration in complex domains is a key challenge in reinforcement learning, especially for tasks with very sparse rewards. Recent successes in deep reinforcement learning have been achieved mostly using simple heuristic exploration strategies such as $ε$-greedy action selection or Gaussian control noise, but there are many tasks where these methods are insufficient to make any learning progress. Here, we consider more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent's surprise about its experiences via intrinsic motivation. We propose to learn a model of the MDP transition probabilities concurrently with the policy, and to form intrinsic rewards that approximate the KL-divergence of the true transition probabilities from the learned model. One of our approximations results in using surprisal as intrinsic motivation, while the other gives the $k$-step learning progress. We show that our incentives enable agents to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards, including continuous control tasks and games in the Atari RAM domain, outperforming several other heuristic exploration techniques.

研究动机与目标

在奖励稀疏的环境中促进深度强化学习的探索。
开发基于真实转移动力学与学习到的转移动力学不匹配的可扩展内在奖励。
与策略并行学习转移模型以引导探索。
将 surprisal 和 k-step learning progress 激励与现有探索方法（包括 VIME）进行比较。

提出的方法

将内在奖励表述为真实 P 与学习到的 P_phi 之间的 KL 散度，并推导出两种可扩展的近似。
使用 surprisal：内在奖励与 -log P_phi(s'|s,a) 成正比。
使用 k-step 学习进展：内在奖励基于 log P_phi_t(s'|s,a) - log P_phi_{t-k}(s'|s,a)。
通过带正则化和 KL 散度约束（Eq. 11）的准监督型损失联合更新动力学模型 P_phi。
更新策略以最大化环境回报加上 eta 乘以真实与学习到的动力学之间的期望 KL（Eq. 2）。
调整 eta 以使内在奖励有界并对奖励进行归一化以实现稳定性。

实验结果

研究问题

RQ1在高维、奖励稀疏的深度强化学习环境中，surprisal 和学习进展作为内在奖励是否能改善探索？
RQ2在连续控制和 Atari RAM 领域，这些内在激励与现有方法（如 VIME 和 L2 模型预测误差）相比如何？
RQ3单一前向动力学模型是否能够在包括确定性和随机性动力学的多样任务中提供可扩展、鲁棒的内在动机？
RQ4学习进展奖金中的 k 在不同任务中的影响是什么？

主要发现

surprisal 激励在包括连续控制和 Atari RAM 领域在内的广泛任务中实现了稳健、改进的探索。
k-step 学习进展在某些任务中有帮助，但在某些环境和 k 值下可能不及 surprisal。
surprisal 通常优于 L2 模型预测误差，在计算成本较低时也能与 VIME 竞争。
该方法使用全因子高斯动力学模型和前向传播工作，带来比 VIME 更快的加速。
在更难的任务如 SwimmerGather 和 Venture-RAM 中，surprisal 通常优于其他内在动机基线。
即使在简单探索都会失败的情况下，surprisal 仍然有效，表明它在稀疏奖励环境中驱动有意义的探索。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。