QUICK REVIEW

[Paper Review] Evolved Policy Gradients

Rein Houthooft, Richard Y. Chen|arXiv (Cornell University)|Feb 13, 2018

Reinforcement Learning in Robotics48 references96 citations

TL;DR

EPG learns a differentiable, temporally-structured loss via evolution strategies to train RL agents, enabling faster learning and generalization to new tasks without reward signals at test time.

ABSTRACT

We propose a metalearning approach for learning gradient-based reinforcement learning (RL) algorithms. The idea is to evolve a differentiable loss function, such that an agent, which optimizes its policy to minimize this loss, will achieve high rewards. The loss is parametrized via temporal convolutions over the agent's experience. Because this loss is highly flexible in its ability to take into account the agent's history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm (EPG) achieves faster learning on several randomized environments compared to an off-the-shelf policy gradient method. We also demonstrate that EPG's learned loss can generalize to out-of-distribution test time tasks, and exhibits qualitatively different behavior from other popular metalearning algorithms.

Motivation & Objective

Introduce a metalearning framework that learns a differentiable loss for RL agents.
Use evolution strategies to optimize the loss parameters so that inner-loop learning yields high final returns.
Design a loss architecture that leverages agent history via temporal convolutions.
Demonstrate faster learning and out-of-distribution generalization across randomized continuous control tasks.
Show that learned losses can outperform standard policy gradient baselines on the target task distribution.

Proposed method

Formulate a two-loop metalearning process where an outer loop evolves a loss function Lφ.
Represent Lφ with temporal convolutions over recent agent experience to capture history.
Optimize the inner-loop policy πθ by SGD against Lφ.
Use evolution strategies to optimize φ because final returns are not explicit functions of φ.
Incorporate a memory unit and a buffer to provide history to the loss, plus a context vector from temporal convolutions.
Bootstrap learning with a reward-based surrogate loss Lpg that is annealed to 0, allowing Lφ to dominate training over time.

Experimental results

Research questions

RQ1Can a learned, differentiable loss surrogate improve sample efficiency and final performance of RL agents on a distribution of tasks?
RQ2Does evolving the loss function via ES yield policies that generalize to unseen or out-of-distribution tasks?
RQ3How does the EPG loss leverage agent history to enable fast adaptation and exploration without relying on test-time rewards?
RQ4What is the relationship between gradients produced by the learned loss and traditional policy-gradient objectives?

Key findings

EPG trains agents faster than an off-the-shelf policy gradient method on several randomized continuous control tasks.
The learned loss Lφ can generalize to out-of-distribution test-time tasks, exhibiting qualitatively different behavior from other metalearning methods.
Including a memory mechanism and temporal convolutions enables the loss to utilize agent history for better guidance during inner-loop updates.
Test-time training with the learned loss does not require reward signals, yet can achieve high final performance within the training task distribution.
Evolving the policy initialization together with the loss (EPG+I) can yield different, sometimes advantageous, learning dynamics compared to standard baselines.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.