QUICK REVIEW

[论文解读] Automatic Goal Generation for Reinforcement Learning Agents

Carlos Florensa, David Held|arXiv (Cornell University)|May 17, 2017

Reinforcement Learning in Robotics参考文献 47被引用 144

一句话总结

本文提出 Goal GAN，一种对抗框架，能够自动生成中等难度的目标，以训练单一策略在稀疏奖励下达到多样且连续的目标集合，从而实现自动化的学习曲线与改进的样本效率。

ABSTRACT

Reinforcement learning is a powerful technique to train an agent to perform a task. However, an agent that is trained using reinforcement learning is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing. We use a generator network to propose tasks for the agent to try to achieve, specified as goal states. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment. Our method can also learn to achieve tasks with sparse rewards, which traditionally pose significant challenges.

研究动机与目标

促使学习一个能够达到多样且连续目标集合的策略，而非单一任务。
使自动学习曲线生成能够匹配代理当前的能力。
开发一个以目标为条件的强化学习框架，在稀疏奖励条件下无需手工设计奖励也能工作。
展示改进的样本效率并扩展到更高维度的目标空间的可扩展性。

提出的方法

将目标定义为状态空间的参数化子集，并对到达目标给出二进制奖励。
引入 Goal GAN，以生成落在当前策略的中等难度目标（GOID）之中的目标。
用代理观察到的成功对目标进行标注，以正负样本来训练 GAN。
迭代地在 GOID 样本上训练策略，并根据策略性能更新 GAN。
在策略更新中使用 TRPO 结合 GAE 作为底层强化学习优化器。

实验结果

研究问题

RQ1通过 Goal GAN 的自动学习曲线生成，是否可以在学习达到多个目标时提升样本效率，相较于基线？
RQ2Goal GAN 是否能自适应地抽样中等难度目标并跟踪多模态目标分布？
RQ3在保持性能的同时，该方法如何扩展到更高维度或更复杂的目标空间？
RQ4在没有手工设计奖励的情况下，该方法对稀疏奖励设置是否鲁棒？
RQ5在随着时间扩展可达目标集合的同时，该方法是否能够防止遗忘？

主要发现

Goal GAN 通过聚焦于中等难度的目标来加速学习，优于均匀采样和若干基线。
生成器会动态转向 GOID，随着策略的提升产生既不太容易也不太难的目标。
该方法跟踪多模态目标分布并保持多样的目标覆盖，包括迷宫样环境。
在高维目标空间中，该方法通过在可行子集内生成目标来保持有效性，避免无信息样本。
一个拒绝采样的oracle变体证实了基于 GOID 的采样接近最优，同时完整的基于GAN的方法仍然在样本效率方面远优于其他方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。