QUICK REVIEW

[论文解读] Latent Space Policies for Hierarchical Reinforcement Learning

Tuomas Haarnoja, Kristian Hartikainen|arXiv (Cornell University)|Apr 9, 2018

Reinforcement Learning in Robotics参考文献 34被引用 73

一句话总结

本文提出潜变量、可逆策略层用于分层深度强化学习，采用最大熵目标进行训练，使高层通过潜在空间控制低层，在连续控制任务中实现了性能提升。

ABSTRACT

We address the problem of learning hierarchical deep neural network policies for reinforcement learning. In contrast to methods that explicitly restrict or cripple lower layers of a hierarchy to force them to use higher-level modulating signals, each layer in our framework is trained to directly solve the task, but acquires a range of diverse strategies via a maximum entropy reinforcement learning objective. Each layer is also augmented with latent random variables, which are sampled from a prior distribution during the training of that layer. The maximum entropy objective causes these latent variables to be incorporated into the layer's policy, and the higher level layer can directly control the behavior of the lower layer through this latent space. Furthermore, by constraining the mapping from latent variables to actions to be invertible, higher layers retain full expressivity: neither the higher layers nor the lower layers are constrained in their behavior. Our experimental evaluation demonstrates that we can improve on the performance of single-layer policies on standard benchmark tasks simply by adding additional layers, and that our method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.

研究动机与目标

在不削弱较低层的前提下激励分层强化学习，使每一层能够直接解决任务，同时提供多样化的策略。
开发一个潜变量策略框架，使更高层通过可逆映射影响较低层。
实现稳定、可扩展的训练，使用最大熵强化学习和基于正则化流的潜在变量到动作的变换。
证明增加层数在标准基准上提高性能，并能解决稀疏奖励任务。

提出的方法

将强化学习表述为最大熵推断，并通过引入潜变量来创建分层策略。
使用可逆神经网络变换（真实值的非体积保持变换）将潜在变量映射到动作，且以状态为条件。
自下而上训练各层，每一层学习带有其潜在变量的策略，同时将潜在空间作为上一层的动作空间。
将每个学习到的变换嵌入环境中以重新定义动力学，促使后续层在更高层的动作上操作。
可选地对较低层提供成形奖励以简化更高层目标的学习，同时保持基于熵的探索。
采用软演员-评论家（SAC）实现，以实现鲁棒且样本高效的训练。

实验结果

研究问题

RQ1潜变量、可逆策略层是否能在连续控制任务中提升学习效率和最终性能？
RQ2自下而上的分层潜空间策略逐层训练是否比端到端训练获得更好的分层强化学习结果？
RQ3在稀疏奖励环境中，为较低层提供成形奖励对学习更高层策略有何影响？
RQ4高层策略在潜空间上能在多大程度上控制底层行为？
RQ5该方法是否可扩展到更深的层次结构和高维控制问题？

主要发现

潜空间分层策略在若干连续控制基准（包括高维任务）中达到最先进性能。
以自下而上、逐层训练的两层策略优于单层策略，并与端到端的深层策略相比表现良好。
增加层数在像蚂蚁（Ant）和人形机器人（Humanoid）等挑战性任务上带来显著性能提升。
较低层的成形奖励有助于解决稀疏奖励任务，同时由于可逆变换，仍受上层控制。
该方法在多种环境中表现出更好的样本效率和鲁棒学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。