Skip to main content
QUICK REVIEW

[论文解读] Learning Multi-Level Hierarchies with Hindsight

Andrew Levy, George Konidaris|arXiv (Cornell University)|Dec 4, 2017
Reinforcement Learning in Robotics被引用 75
一句话总结

本论文介绍分层 Actor-Critic(HAC),是一种分层强化学习框架,在并行训练多个策略层级时使用事后行动/目标转化来克服非平稳性和稀疏奖励,从而在连续状态/动作空间中实现高效学习。

ABSTRACT

Hierarchical agents have the potential to solve sequential decision making tasks with greater sample efficiency than their non-hierarchical counterparts because hierarchical agents can break down tasks into sets of subtasks that only require short sequences of decisions. In order to realize this potential of faster learning, hierarchical agents need to be able to learn their multiple levels of policies in parallel so these simpler subproblems can be solved simultaneously. Yet, learning multiple levels of policies in parallel is hard because it is inherently unstable: changes in a policy at one level of the hierarchy may cause changes in the transition and reward functions at higher levels in the hierarchy, making it difficult to jointly learn multiple levels of policies. In this paper, we introduce a new Hierarchical Reinforcement Learning (HRL) framework, Hierarchical Actor-Critic (HAC), that can overcome the instability issues that arise when agents try to jointly learn multiple levels of policies. The main idea behind HAC is to train each level of the hierarchy independently of the lower levels by training each level as if the lower level policies are already optimal. We demonstrate experimentally in both grid world and simulated robotics domains that our approach can significantly accelerate learning relative to other non-hierarchical and hierarchical methods. Indeed, our framework is the first to successfully learn 3-level hierarchies in parallel in tasks with continuous state and action spaces.

研究动机与目标

  • 动机,使用层次结构来加速序列决策任务中的学习。
  • 建立一个框架,在转移非平稳的情况下实现并行学习多级策略。
  • 提出机制(事后行动/目标转移和子目标测试)以实现稀疏奖励条件下的稳定并行学习。
  • 展示在网格世界和连续机器人领域对2级和3级层次结构的可扩展性。

提出的方法

  • 提出分层 Actor-Critic (HAC),将单一的 UMDP 转换为每个层级的多个嵌套 UMDP。
  • 使用目标条件化策略,每个层级为下一级输出子目标,最终在最底层输出原始动作。
  • 采用嵌套转换函数,其中上层的转换依赖于完整的下层策略体系。
  • 引入事后行动转移以模拟最优的下层层级,从而在各层之间稳定学习。
  • 引入事后目标转移,将事后经验回放扩展到分层设置,以应对稀疏奖励。
  • 添加子目标测试转移,确保子目标能被当前下层策略实现,并平衡学习信号。

实验结果

研究问题

  • RQ1HAC 能否在离散和连续域中并行学习多层策略?
  • RQ2HAC 是否能够并行训练3层层次结构,且与2层和纯基线相比如何?
  • RQ3事后行动/目标转移和子目标测试转移是否缓解非平稳性并提升学习效率?
  • RQ4在连续机器人任务中,HAC 相对于 HIRO 的表现如何?

主要发现

  • HAC 在离散和连续任务中显著优于平坦代理。
  • 并行学习的3级层次结构优于2级层次结构,而2级层次结构又优于平坦学习。
  • 在实验中的三个仿真机器人任务中,HAC 的表现优于 HIRO。
  • 事后行动与目标转移,以及子目标测试,能够实现稳定的并行学习并缓解来自非平稳转移的问题。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。