QUICK REVIEW

[论文解读] Data-Efficient Hierarchical Reinforcement Learning

Ofir Nachum, Shixiang Gu|arXiv (Cornell University)|May 21, 2018

Reinforcement Learning in Robotics参考文献 3被引用 265

一句话总结

tldr: 提出 HIRO，一种离策略训练的两层 HRL 代理，采用离策略校正，在样本效率和在运动控制与对象交互任务上的性能方面表现出色。

ABSTRACT

Hierarchical reinforcement learning (HRL) is a promising approach to extend traditional reinforcement learning (RL) methods to solve more complex tasks. Yet, the majority of current HRL methods require careful task-specific design and on-policy training, making them difficult to apply in real-world scenarios. In this paper, we study how we can develop HRL algorithms that are general, in that they do not make onerous additional assumptions beyond standard RL algorithms, and efficient, in the sense that they can be used with modest numbers of interaction samples, making them suitable for real-world problems such as robotic control. For generality, we develop a scheme where lower-level controllers are supervised with goals that are learned and proposed automatically by the higher-level controllers. To address efficiency, we propose to use off-policy experience for both higher and lower-level training. This poses a considerable challenge, since changes to the lower-level behaviors change the action space for the higher-level policy, and we introduce an off-policy correction to remedy this challenge. This allows us to take advantage of recent advances in off-policy model-free RL to learn both higher- and lower-level policies using substantially fewer environment interactions than on-policy algorithms. We term the resulting HRL agent HIRO and find that it is generally applicable and highly sample-efficient. Our experiments show that HIRO can be used to learn highly complex behaviors for simulated robots, such as pushing objects and utilizing them to reach target locations, learning from only a few million samples, equivalent to a few days of real-time interaction. In comparisons with a number of prior HRL methods, we find that our approach substantially outperforms previous state-of-the-art techniques.

研究动机与目标

激励并发展通用、数据高效的 HRL，使其能够与标准 RL 组件协同工作。
学习由更高层控制器自动提出的目标所引导的低层策略。
为层级中的两个水平启用离策略训练，以提高样本效率。
引入离策略校正，以应对低层变动的非平稳性。
在有限交互数据的情况下，在具有挑战性的仿真机器人任务上展示出色的性能。

提出的方法

具有高层策略（目标）和低层策略（动作）的两层层级结构。
低层接收目标 g_t 并获得内在奖励 r = -||s_t + g_t - s_{t+1}||_2；高层每隔 c 步在时序上扩展的目标上进行优化。
高层经验进行重新标注（离策略校正），以最大化在当前低层控制器下过去的低层动作发生的概率，从而实现离策略学习。
两者策略均用带回放缓冲区的离策略 TD 方法（TD3）进行训练。
目标直接在原始状态观测中定义，避免学习嵌入或手工目标空间。
为高层重新标注使用八种候选项的重新标注程序，以及原始目标和基于差分的目标，用以近似似然性的最大化。

实验结果

研究问题

RQ1一个在离策略校正下离策略训练的两层 HRL 系统是否能够高效学习复杂任务？
RQ2将原始状态观测作为低层策略的目标是否能提升学习速度和性能？
RQ3与简单的离策略 HRL 相比，所提出的离策略校正在稳定性和样本效率方面的影响如何？
RQ4在具有挑战性的运动控制和对象交互任务上，HIRO 相对于先前的 HRL 方法的性能如何？

主要发现

蚂蚁收集任务	蚂蚁迷宫任务	蚂蚁推送任务	蚂蚁掉落任务
HIRO	3.02 ± 1.49	0.99 ± 0.01	0.92 ± 0.04	0.66 ± 0.07
FuN 表征	0.03 ± 0.01	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0
FuN 过渡策略 PG	0.41 ± 0.06	0.0 ± 0.0	0.56 ± 0.39	0.01 ± 0.02
FuN 余弦相似度	0.85 ± 1.17	0.16 ± 0.33	0.06 ± 0.17	0.07 ± 0.22
FuN	0.01 ± 0.01	0.0 ± 0.0	0.0 ± 0.0	0.0 ± 0.0
SNN4HRL	1.92 ± 0.52	0.0 ± 0.0	0.02 ± 0.01	0.0 ± 0.0
VIME	1.42 ± 0.90	0.0 ± 0.0	0.02 ± 0.02	0.0 ± 0.0

HIRO 在 Ant Gather、Ant Maze、Ant Push 和 Ant Fall 任务上表现出色。
在 10M 步骤上，HIRO 在所有任务上优于基线，包括 FuN 变体、SNN4HRL 和 VIME；Ant Gather 是在对低层进行预训练的情况下最接近的竞争者。
HIRO 展示了快速学习能力，在数百万步的环境交互后解决复杂任务（相当于几天的现实世界交互）。
离策略校正在 harder 任务上的稳定性和性能方面至关重要，而简单的离策略学习在 Ant Push 和 Ant Fall 中表现下降。
使用原始状态观测作为目标提供了即时的内在奖励信号，并在各任务之间实现简单泛化。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。