QUICK REVIEW

[论文解读] Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning

Abhishek Gupta, Vikash Kumar|arXiv (Cornell University)|Oct 25, 2019

Reinforcement Learning in Robotics被引用 116

一句话总结

Relay Policy Learning (RPL) 将来自非结构化演示的模仿学习与分层强化学习相结合，以解决长时程的机器人任务，并通过强化学习进行微调。它使用 Relay 数据重新标注来训练双层、目标条件策略，并在基线方法之上实现性能提升。

ABSTRACT

We present relay policy learning, a method for imitation and reinforcement learning that can solve multi-stage, long-horizon robotic tasks. This general and universally-applicable, two-phase approach consists of an imitation learning stage that produces goal-conditioned hierarchical policies, and a reinforcement learning phase that finetunes these policies for task performance. Our method, while not necessarily perfect at imitation learning, is very amenable to further improvement via environment interaction, allowing it to scale to challenging long-horizon tasks. We simplify the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved. While we rely on demonstration data to bootstrap policy learning, we do not assume access to demonstrations of every specific tasks that is being solved, and instead leverage unstructured and unsegmented demonstrations of semantically meaningful behaviors that are not only less burdensome to provide, but also can greatly facilitate further improvement using reinforcement learning. We demonstrate the effectiveness of our method on a number of multi-stage, long-horizon manipulation tasks in a challenging kitchen simulation environment. Videos are available at https://relay-policy-learning.github.io/

研究动机与目标

通过最小化人工任务标注来推动解决多阶段、长时程的机器人任务。
从非结构化演示中自举分层策略，以便后续的强化学习微调。
引入 relay 数据重新标注，为高层和低层策略创建目标条件数据集。
启用强化学习微调，保持简单的目标条件奖励结构并提升采样效率。

提出的方法

提出一个具有高层目标设定者和低层子目标条件策略的双层分层策略。
使用固定的高层规划区间（H），使高层在H步内设定子目标，而低层在每一步执行。
引入 relay 数据重新标注，以从非结构化演示中为两个层次生成目标条件数据集（算法2和3）。
在重新标注的数据上通过监督模仿学习训练高层和低层策略，以初始化策略（relay imitation learning，RIL）。
在将演示通过最大似然项纳入以利用重新标注数据的同时，使用基于目标条件的自然梯度（NPG）对策略进行微调（relay reinforcement fine-tuning，RRF）。
将多种微调后的行为蒸馏到单一多任务策略，以实现泛化。

实验结果

研究问题

RQ1非结构化、未分段的演示是否能够通过模仿学习有效地自举出分层策略？
RQ2相较于扁平化或从头学习的策略，relay-模仿学习的策略是否更易于进行强化学习微调？
RQ3relay 策略学习是否能够在类似厨房的环境中解决复杂的长时程操作任务？
RQ4是否将多种微调任务蒸馏到单一多任务策略后，仍能在不同目标上保持性能？

主要发现

RIL 在模仿学习方面优于扁平化的目标条件模仿，即使演示数据未带标签。
对 relay 策略的 RL 微调显著优于基线，在微调过程中结合演示数据可带来显著增益（RRF）。
通过蒸馏步骤得到的单一多任务策略能够解决多种复合目标。
窗口大小和奖励设计对性能有关键影响；较大窗口会降低模仿和微调效果，探索时方向性强的稀疏奖励效果最好。
RPL 在长时程的厨房任务上超越从头学得的分层强化学习和纯模仿学习基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。