QUICK REVIEW

[论文解读] Large-Scale Study of Curiosity-Driven Learning

Yuri Burda, Harri Edwards|arXiv (Cornell University)|Aug 13, 2018

Psychological and Educational Research Studies参考文献 41被引用 364

一句话总结

这篇论文在54个环境中进行了一项大规模的经验研究，研究仅由内在好奇心驱动的学习，未使用外在奖励，比较前向动力学的特征空间，并凸显基于预测误差的好奇心的优点与局限。

ABSTRACT

Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. (b) We investigate the effect of using different feature spaces for computing prediction error and show that random features are sufficient for many popular RL game benchmarks, but learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.). (c) We demonstrate limitations of the prediction-based rewards in stochastic setups. Game-play videos and code are at https://pathak22.github.io/large-scale-curiosity/

研究动机与目标

将内在好奇心作为可扩展的替代方案，取代在强化学习中手设计的外在奖励。
系统性地研究涵盖Atari、马里奥和3D导航等54个环境的好奇心驱动学习。
评估不同的前向动力学特征空间如何影响基于好奇心的探索。
评估在没有外在奖励的情况下，好奇心驱动智能体的可扩展性、稳定性和泛化能力。

提出的方法

使用基于动力学的内在奖励，定义为 r_t = -log p(phi(x_{t+1}) | x_t, a_t) (surprisal)，源自前向动力学。
比较用于嵌入观测 phi 的特征空间：原始像素、随机特征、反向动力学特征（IDF）以及变分自编码器（VAE）。
使用带奖励和优势归一化、观测归一化、众多并行执行者以及特征批归一化来稳定训练。
移除结束回合信号，以研究无限-horizon、纯粹由好奇心驱动的探索。
在54个环境（Atari、马里奥、Roboschool、Unity）上进行评估，并分析对新关卡的泛化。

实验结果

研究问题

RQ1纯粹由好奇心驱动的智能体是否能在多样化环境中学习到有意义的行为，而无需外在奖励？
RQ2不同的观测嵌入策略（RF、VAE、IDF、像素）如何影响好奇心驱动的探索和泛化？
RQ3基于好奇心的探索在人工设计的环境中是否与外在奖励对齐，以及在随机性设置中的局限性？
RQ4通过好奇心学习的技能在多大程度上可以在不增加奖励的情况下转移到新关卡或环境？

主要发现

好奇心驱动的智能体在许多Atari游戏中能够获得外在奖励，而无需任何外在训练奖励。
随机特征在许多基准测试中通常提供简单且稳定的好奇心嵌入；学习的特征在新关卡上的泛化能力更好（如马里奥）。
在大约55%的Atari游戏中，反向动力学特征优于随机特征，而原始像素在前向动力学方面表现不佳。
在马里奥中，将批量规模从128增大到2048并行线程，显著提升探索表现和关卡发现。
在奖励稀疏或终局奖励的任务中，好奇心可以提供帮助，在仅外在奖励训练无法取得进展的情况改善表现。
随机性（噪声电视）可能误导好奇心，减慢学习，但在某些情况下并不永久阻止最终获得外在奖励。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。