QUICK REVIEW

[Paper Review] Variational Option Discovery Algorithms

Joshua Achiam, Harrison Edwards|arXiv (Cornell University)|Jul 26, 2018

Reinforcement Learning in Robotics18 references82 citations

TL;DR

The paper introduces VALOR, a variational option discovery method decoding from trajectories, and a curriculum strategy to scalablely learn hundreds of diverse behaviors; it also compares VALOR with VIC and DIAYN and explores downstream task applicability and limitations.

ABSTRACT

We explore methods for option discovery based on variational inference and make two algorithmic contributions. First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent's performance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.

Motivation & Objective

Investigate variational inference methods for discovering options (skills) without extrinsic rewards.
Establish a connection between variational option discovery and variational autoencoders.
Propose VALOR, a trajectory-decode-based option discovery method.
Introduce curriculum learning to stabilize and accelerate learning across many contexts.
Assess the diversity, qualitative nature, and potential downstream utility of learned options.

Proposed method

Formulate option discovery as maximizing a variational objective where a context c is encoded into a trajectory via a policy and decoded from the trajectory.
Show that the objective aligns with a beta-VA E-like bound, connecting VIC/DIAYN to a VAE template.
Propose VALOR, where the decoder observes full trajectories but not actions, using a bidirectional LSTM to decode contexts from trajectory deltas.
Implement a curriculum that gradually increases the number of contexts K as the decoder performance improves (threshold-based growth).
Compare VALOR, VIC, and DIAYN in locomotion environments (point mass, Half-Cheetah, Swimmer, Ant) with and without the curriculum; employ recurrent policies and policy gradient training.
Explore downstream task potential by integrating a pretrained VALOR policy as a lower level in a hierarchical Ant-Maze task.

Experimental results

Research questions

RQ1How can variational inference principles be applied to discover diverse options without task-specific rewards?
RQ2What is the relationship between variational option discovery methods and variational autoencoders, and how can this guide new algorithms?
RQ3Does a curriculum that expands context complexity stabilize training and enable learning hundreds of modes?
RQ4How do VALOR, VIC, and DIAYN compare in terms of diversity, learning speed, and qualitative behavior across different robotics environments?
RQ5Are learned options useful for downstream hierarchical control tasks?

Key findings

VALOR encodes contexts into trajectories and decodes contexts from trajectories, promoting diverse, trajectory-centered behaviors.
A curriculum that progressively increases the number of contexts improves training stability and speed across VALOR, VIC, and DIAYN.
All three methods learn multiple locomotion modes with similar overall performance; VALOR yields qualitatively different behaviors due to its trajectory-centric decoding.
DIAYN tends to learn faster due to its denser reward signal, while VALOR emphasizes dynamical modes like circular movements.
The curriculum achieves faster mastery for larger context sets (e.g., up to 64 contexts) and yields more robust results across seeds.
Hand environments produce naturalistic finger behaviors, while high-dimensional humanoid environments (Toddler) prove more challenging, highlighting limits of purely information-theoretic objectives.
Pretrained VALOR policies can serve as useful lower-level policies in hierarchical downstream tasks, performing comparably to policies trained from scratch or non-hierarchically.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.