QUICK REVIEW

[论文解读] Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?

Ofir Nachum, Haoran Tang|arXiv (Cornell University)|Sep 23, 2019

Reinforcement Learning in Robotics参考文献 40被引用 51

一句话总结

该论文通过实证分析 HRL 并发现大多数好处来自改进的探索，而不是更容易的策略学习或语义动作表示。随后提出与 HRL 表现相当的非层级探索方法。

ABSTRACT

Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action space, among others. However, in fully observed, Markovian settings, it is not immediately clear why hierarchical RL should provide benefits over standard "shallow" RL architectures. In this work, we isolate and evaluate the claimed benefits of hierarchical RL on a suite of tasks encompassing locomotion, navigation, and manipulation. Surprisingly, we find that most of the observed benefits of hierarchy can be attributed to improved exploration, as opposed to easier policy learning or imposed hierarchical structures. Given this insight, we present exploration techniques inspired by hierarchy that achieve performance competitive with hierarchical RL while at the same time being much simpler to use and implement.

研究动机与目标

激励研究为什么分层强化学习（HRL）在复杂任务中有帮助。
隔离并评估分层在运动、导航与操作任务中的所谓好处。
确定改进是否来自使用时间扩展动作的训练、探索，还是语义表示。
评估非层级方法是否通过利用受层级启发的探索策略来达到 HRL 的性能。

提出的方法

通过实证评估两种HRL范式（options 框架和目标条件 HIRO）在四个运动/导航/操作任务（AntMaze, AntPush, AntBlock, AntBlockMaze）上的表现。
通过解耦训练时限 (c_train) 和探索时限 (c_expl) 来隔离时间抽象效果。
将 HRL 与使用多步奖励训练的非层级代理，以及在 HRL 收集数据上训练的影子代理进行比较。
提出并测试两种受 HRL 启发的探索策略（Explore & Exploit 和 Switching Ensemble），它们不使用显式层级结构。
通过消融实验来区分 HRL 表现中的探索与训练表示。

实验结果

研究问题

RQ1在所考虑的任务中，时间扩展训练还是探索解释了 HRL 的实证收益？
RQ2高级行动表示（语义训练）的好处对 HRL 绩效是否至关重要？
RQ3如果给予类似 HRL 的探索或多步奖励，非层级代理能否达到 HRL 的表现？
RQ4受 HRL 启发的探索策略是否能提升非层级代理至 HRL 水平的表现？

主要发现

大多数 HRL 的好处源自改进的探索，而不是更容易的训练或语义行动表示。
控制探索后发现，多步奖励可以复制 HRL 训练收益的很大一部分，使高级行动表示的重要性降低。
配备时间扩展探索或目标导向探索的非层级代理在若干任务上可达到 HRL 的表现。
两种非层级探索方法（Explore & Exploit 和 Switching Ensemble）达到类似 HRL 的表现，强调探索是关键因素。
显式的层级结构并非强性能的必要条件；在所测试的环境中，受 HRL 启发的探索策略就足够。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。