QUICK REVIEW

[论文解读] Unsupervised Meta-Learning for Reinforcement Learning

Abhishek Gupta, Benjamin Eysenbach|arXiv (Cornell University)|Jun 12, 2018

Machine Learning and Data Classification参考文献 63被引用 57

一句话总结

本文提出无监督元强化学习，通过互信息自动生成任务来学习环境特定的快速学习过程，然后使用 MAML 进行元训练，以便快速适应新的奖励。

ABSTRACT

Meta-learning algorithms use past experience to learn to quickly solve new tasks. In the context of reinforcement learning, meta-learning algorithms acquire reinforcement learning procedures to solve new problems more efficiently by utilizing experience from prior tasks. The performance of meta-learning algorithms depends on the tasks available for meta-training: in the same way that supervised learning generalizes best to test points drawn from the same distribution as the training points, meta-learning methods generalize best to tasks from the same distribution as the meta-training tasks. In effect, meta-reinforcement learning offloads the design burden from algorithm design to task design. If we can automate the process of task design as well, we can devise a meta-learning algorithm that is truly automated. In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning. We motivate and describe a general recipe for unsupervised meta-reinforcement learning, and present an instantiation of this approach. Our conceptual and theoretical contributions consist of formulating the unsupervised meta-reinforcement learning problem and describing how task proposals based on mutual information can be used to train optimal meta-learners. Our experimental results indicate that unsupervised meta-reinforcement learning effectively acquires accelerated reinforcement learning procedures without the need for manual task design and these procedures exceed the performance of learning from scratch.

研究动机与目标

通过消除手工设计元训练任务，降低元强化学习中的人力成本。
在固定的环境动力学下实现对新奖励函数的快速适应。
证明基于互信息的任务提案能够产生接近最优解的元学习器。
展示相比从零开始学习和纯探索再微调的优点。

提出的方法

定义一个无奖励的 CMP（受控马尔可夫过程），将学习表述为寻找一个快速适应的过程 f。
通过由潜变量 z 诱导的参数化奖励 r_z(s,a) 提出任务，优化以最小化最坏情形后悔。
通过使用互信息目标生成多样化任务（基于 DIAYN）和元学习器（MAML）来实现实际的无监督元学习。
训练判别器 D_phi 以最大化 I(z;s)，并推导 r_z(s,a)=log D_phi(z|s) 用于任务生成。
使用 DIAYN 获得潜变量条件策略，然后应用 MAML 学习如何在所提议的任务之间进行学习。
讨论一个随机任务基线（随机判别器）作为对照。

实验结果

研究问题

RQ1无监督的任务提案是否能够消除在元强化学习中手工设计元训练任务分布的需要？
RQ2基于互信息的任务提案是否能够产生面向环境的快速学习过程，以适应未见的奖励函数？
RQ3在基准控制任务上，无监督元强化学习与从零开始学习以及手工设计的元训练分布相比如何？

主要发现

与从零开始学习相比，无监督元强化学习在多个任务和环境中加速了学习。
在复杂任务中，基于 DIAYN 的任务提案通常优于随机任务提案。
无监督元学习可以接近依赖手工设计任务分布的oracle方法的性能。
在对新奖励进行微调时，UML-DIAYN 往往超越 DIAYN 初始化或基于 VIME 的预训练。
结果表明通过无监督交互学习到的面向环境的先验有助于提升快速适应。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。