QUICK REVIEW

[Paper Review] Latent Space Policies for Hierarchical Reinforcement Learning

Tuomas Haarnoja, Kristian Hartikainen|arXiv (Cornell University)|Apr 9, 2018

Reinforcement Learning in Robotics34 references73 citations

TL;DR

The paper introduces latent-variable, invertible policy layers for hierarchical deep RL trained with maximum entropy objectives, enabling higher layers to control lower layers via latent spaces and achieving improved performance on continuous control tasks.

ABSTRACT

We address the problem of learning hierarchical deep neural network policies for reinforcement learning. In contrast to methods that explicitly restrict or cripple lower layers of a hierarchy to force them to use higher-level modulating signals, each layer in our framework is trained to directly solve the task, but acquires a range of diverse strategies via a maximum entropy reinforcement learning objective. Each layer is also augmented with latent random variables, which are sampled from a prior distribution during the training of that layer. The maximum entropy objective causes these latent variables to be incorporated into the layer's policy, and the higher level layer can directly control the behavior of the lower layer through this latent space. Furthermore, by constraining the mapping from latent variables to actions to be invertible, higher layers retain full expressivity: neither the higher layers nor the lower layers are constrained in their behavior. Our experimental evaluation demonstrates that we can improve on the performance of single-layer policies on standard benchmark tasks simply by adding additional layers, and that our method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.

Motivation & Objective

Motivate hierarchical RL without crippling lower layers, enabling each layer to directly solve tasks while providing diverse strategies.
Develop a latent-variable policy framework where higher layers influence lower layers through invertible mappings.
Achieve stable, scalable training using maximum entropy RL and normalizing-flow based latent-to-action transformations.
Demonstrate that adding layers improves performance on standard benchmarks and enables solving sparse-reward tasks.

Proposed method

Formulate RL as maximum entropy inference and augment with latent variables to create hierarchical policies.
Use invertible neural network transforms (real-valued non-volume preserving transforms) to map latent variables to actions, conditioned on state.
Train layers bottom-up, each layer learning a policy with its latent variable while providing the latent space as the action space for the layer above.
Embed each learned transformation into the environment to redefine dynamics and enable subsequent layers to operate on higher-level actions.
Optionally employ shaping rewards for lower layers to simplify learning of higher-level objectives, while preserving entropy-based exploration.
Implement with Soft Actor-Critic (SAC) for robust, sample-efficient training.

Experimental results

Research questions

RQ1Can latent-variable, invertible policy layers improve learning efficiency and final performance in continuous control tasks?
RQ2Does bottom-up, layerwise training of latent-space policies yield better Hierarchical RL outcomes than end-to-end training?
RQ3How does providing shaping rewards to lower layers affect learning of higher-level policies in sparse-reward settings?
RQ4To what extent can the higher-level policies control lower-level behavior through the latent space?
RQ5Is the approach scalable to deep hierarchies and high-dimensional control problems?

Key findings

Latent-space hierarchical policies achieve state-of-the-art performance in several continuous control benchmarks, including high-dimensional tasks.
Two-level policies trained in a bottom-up, layerwise fashion outperform single-level policies and compare favorably to end-to-end deeper policies.
Adding layers yields substantial performance gains on challenging tasks like Ant and Humanoid.
Lower-level shaping rewards can help solve sparse-reward tasks while remaining controllable by higher levels due to invertible transformations.
The method demonstrates improved sample efficiency and robust learning across multiple environments.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.