QUICK REVIEW

[论文解读] Reverse Curriculum Generation for Reinforcement Learning

Carlos Florensa, David Held|arXiv (Cornell University)|Jul 17, 2017

Reinforcement Learning in Robotics参考文献 33被引用 140

一句话总结

本文提出一个反向学习的强化学习框架，自动生成起始状态的课程表，从给定目标出发逐步扩展到更难的起始状态，使在缺乏演示或奖励塑形的稀疏目标任务中实现高效学习成为可能。

ABSTRACT

Many relevant tasks require an agent to reach a certain state, or to manipulate objects into a desired configuration. For example, we might want a robot to align and assemble a gear onto an axle or insert and turn a key in a lock. These goal-oriented tasks present a considerable challenge for reinforcement learning, since their natural reward function is sparse and prohibitive amounts of exploration are required to reach the goal and receive some learning signal. Past approaches tackle these problems by exploiting expert demonstrations or by manually designing a task-specific reward shaping function to guide the learning agent. Instead, we propose a method to learn these tasks without requiring any prior knowledge other than obtaining a single state in which the task is achieved. The robot is trained in reverse, gradually learning to reach the goal from a set of start states increasingly far from the goal. Our method automatically generates a curriculum of start states that adapts to the agent's performance, leading to efficient training on goal-oriented tasks. We demonstrate our approach on difficult simulated navigation and fine-grained manipulation problems, not solvable by state-of-the-art reinforcement learning methods.

研究动机与目标

解决在没有奖励塑形或演示的情况下，如何学习以目标为导向的稀疏奖励任务。
提出一种课程表，使起始状态分布能够适应智能体当前的表现。
开发一种通过从目标出发，使用局部扰动自动生成起始状态的方法。
在超越以往强化学习能力的具有挑战性的机器人导航与操控任务上证明其有效性。

提出的方法

将起始状态分布在各迭代中可变的学习过程形式化，以最大化学习速度。
将“好起始状态”定义为当前策略在这些状态下达到中间性成功的状态。
通过从种子状态出发，在动作空间中进行短的布朗运动样式的滚动来生成附近的起始状态。
使用先前良好起始状态的回放缓冲区来稳定学习并实现逐步扩展。
迭代地使用 TRPO（或任何在策略方法）在自适应的起始分布上训练策略。
在原始起始状态分布上评估进展以确保泛化。

实验结果

研究问题

RQ1在训练过程中自适应起始状态分布是否可加速目标导向任务在稀疏奖励情境下的学习？
RQ2将训练集中在“好起始状态”并从目标扩展，是否比均匀起始状态采样产生更快且更鲁棒的策略？
RQ3通过动作空间的布朗运动生成附近的状态是否是扩展起始状态课程表的有效方式？
RQ4是否可在没有演示或奖励塑形的情况下实现该课程表，并仍然解决具有挑战性的操控任务？

主要发现

自适应起始状态课程表相比均匀起始状态采样可提高学习速度和最终表现。
该方法使在当时的最先进RL方法无法解决的任务可被解决，包括导航和精细操控。
训练集中在接近目标的好起始状态并向外扩展，使得在没有模型的情况下实现类似逆向学习。
基于布朗运动的邻近起始状态生成比使用所有先前起始状态更高效地促进课程表增长。
一个简单的消融实验，使用所有先前起始状态但不定位于良好起始状态，其性能不如所提方法。
一个oracle拒绝采样的上界表明，在给定近似的前提下，该方法接近实际可行性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。