QUICK REVIEW

[论文解读] Episodic Curiosity through Reachability

Nikolay Savinov, Anton Raichuk|arXiv (Cornell University)|Oct 4, 2018

Reinforcement Learning in Robotics参考文献 29被引用 162

一句话总结

本文提出通过基于 episodic memory 的 reachability 预测器来提供密集探索奖励，从而在稀疏奖励的强化学习中提升性能，适用于三维环境的探索。它在 VizDoom/DMLab 上优于 ICM，并在 MuJoCo 中实现了第一人称视角的好奇心。

ABSTRACT

Rewards are sparse in the real world and most of today's reinforcement learning algorithms struggle with such sparsity. One solution to this problem is to allow the agent to create rewards for itself - thus making rewards dense and more suitable for learning. In particular, inspired by curious behaviour in animals, observing something novel could be rewarded with a bonus. Such bonus is summed up with the real task reward - making it possible for RL algorithms to learn from the combined reward. We propose a new curiosity method which uses episodic memory to form the novelty bonus. To determine the bonus, the current observation is compared with the observations in memory. Crucially, the comparison is done based on how many environment steps it takes to reach the current observation from those in memory - which incorporates rich information about environment dynamics. This allows us to overcome the known "couch-potato" issues of prior work - when the agent finds a way to instantly gratify itself by exploiting actions which lead to hardly predictable consequences. We test our approach in visually rich 3D environments in ViZDoom, DMLab and MuJoCo. In navigational tasks from ViZDoom and DMLab, our agent outperforms the state-of-the-art curiosity method ICM. In MuJoCo, an ant equipped with our curiosity module learns locomotion out of the first-person-view curiosity only.

研究动机与目标

通过引入一个情节式好奇心模块来产生密集探索奖金，从而解决稀疏奖励强化学习的问题。
利用情节记忆基于与过去观测在可达性上的比较（环境步数）来评估。
训练一个可达性网络和一个嵌入/比较器对以量化新颖性。
在 VizDoom、DMLab 和 MuJoCo 任务中展示对 couch-potato 行为的鲁棒性并提升探索。

提出的方法

使用一个 siamese embedding network E 与一个 comparator C 形成可达性网络 R(o_i, o_j)=C(E(o_i), E(o_j)).
在一个episode中维护过去嵌入的情节记忆 M；若当前观测的新颖性 b 超过阈值则存储。
通过一个依赖于到记忆项的估计距离的函数 B(M, e) 从记忆可达性计算新颖性奖金 b；用 b 增强任务奖励 r。
使用来自序列的观测对来训练 R-network；正样本在 k 内时间上接近，负样本距离更远，使用逻辑回归损失。
通过将奖金添加到任务奖励来与 PPO 集成；可在策略学习期间在线或离线训练 R-network。
与 PPO 基线、PPO+ICM 以及 Grid Oracle 在 VizDoom、DMLab 和 MuJoCo 设置中进行比较。

实验结果

研究问题

RQ1基于可达性的情节记忆是否能提供稳健的好奇信号，避免在基于预测误差的方法中观察到的 couch-potato 行为？
RQ2与最先进的基线相比，情节式好奇心是否能提高稀疏奖励三维环境中的学习效率和最终性能？
RQ3在程序生成、高度可变的关卡以及无奖励探索情景下，该方法的表现如何？
RQ4该好奇信号是否与密集奖励任务兼容而不影响性能？
RQ5该方法能否推广到连续控制领域（MuJoCo）的一人眼视角好奇心？

主要发现

方法	稀疏	极度稀疏	带门的稀疏	无奖励	无奖励 - 开火	密集1	密集2
PPO	27.0 ± 5.1	8.6 ± 4.3	1.5 ± 0.1	191 ± 12	217 ± 19	22.8 ± 0.5	9.41 ± 0.02
PPO + ICM	23.8 ± 2.8	11.2 ± 3.9	2.7 ± 0.2	72 ± 2	87 ± 3	20.9 ± 0.6	9.39 ± 0.02
PPO + EC (ours)	26.2 ± 1.9	24.7 ± 2.2	8.5 ± 0.6	475 ± 8	492 ± 10	19.9 ± 0.7	9.53 ± 0.03
PPO + ECO (ours)	41.6 ± 1.7	40.5 ± 1.1	19.8 ± 0.5	472 ± 18	457 ± 32	22.9 ± 0.4	9.60 ± 0.02
PPO + Grid Oracle	56.7 ± 1.3	54.3 ± 1.2	29.4 ± 0.5	796 ± 2	795 ± 3	20.9 ± 0.6	8.97 ± 0.04

EC 在 VizDoom 与 DMLab 导航任务中超过 ICM 基线。
在程序生成的 DMLab 关卡中，EC 相对于 ICM 在稀疏目标上至少实现了 2x 的成功率。
在无奖励探索下，EC 覆盖的区域远大于 ICM（无奖励/无射击变体显现 ICM 难以应对）。
在密集奖励的 DMLab 任务中，EC 对 PPO 的性能影响与基线相比没有显著下降。
MuJoCo 的蚂蚁仅使用 EC 奖励信号就能从第一人称视角学习运动。
在所有基准测试中，EC 提供更快的收敛和比以往好奇方法更鲁棒的探索。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。