QUICK REVIEW

[Paper Review] Trust Region Policy Optimization

John Schulman, Sergey Levine|arXiv (Cornell University)|Feb 19, 2015

Reinforcement Learning in Robotics32 references3,125 citations

TL;DR

TRPO presents a practical policy optimization algorithm with guaranteed monotonic improvement by constraining the policy update via a trust region (KL divergence), enabling scalable learning for large nonlinear policies like neural networks. It performs well on locomotion tasks and Atari games from raw pixels.

ABSTRACT

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.

Motivation & Objective

Motivate stable policy optimization that guarantees monotonic improvement.
Develop a practical algorithm (TRPO) from a theoretical surrogate objective with a KL-based trust region.
Enable learning for large, high-dimensional policy parameterizations (e.g., neural nets) in simulation and vision tasks.

Proposed method

Derive a surrogate objective L_pi_old(pi) that upper-bounds the true return improvement via a KL constraint.
Propose a trust-region update by solving a constrained optimization that maximizes L_pi_old subject to an average KL divergence bound.
Introduce single-path and vine sampling schemes for estimating the surrogate objective and KL constraint from finite samples.
Use an analytic Fisher information matrix-based approximation to efficiently compute the update direction.
Adopt a practical optimization loop with conjugate gradient and line search to update policy parameters.
Relate TRPO to natural policy gradient and other prior methods while using a fixed KL-based constraint instead of a penalty.

Experimental results

Research questions

RQ1Can a surrogate objective with a KL-based trust region guarantee monotonic improvement in policy performance for general stochastic policies?
RQ2How can we reliably estimate the surrogate objective and KL constraint from finite samples for high-dimensional policies?
RQ3Do single-path and vine sampling schemes provide effective trade-offs between bias, variance, and computational cost in practice?
RQ4Does enforcing a KL constraint enable larger, more robust policy updates compared to fixed-penalty approaches across diverse tasks?
RQ5Can TRPO scale to complex, high-dimensional problems such as locomotion with neural policies and Atari games from image inputs?

Key findings

TRPO achieves monotonic policy improvement in practice across diverse tasks, with little hyperparameter tuning.
Both single-path and vine TRPO variants solve challenging locomotion tasks (swimmer, hopper, walker) and perform well on Atari games from pixels.
Constrained KL-based updates are more robust and often outperform fixed-penalty natural gradient approaches in large problems.
CEM and CMA gradient-free methods underperform on high-parameter tasks due to sample complexity.
TRPO, using average KL constraints, yields competitive results on Atari with a convolutional network and demonstrates scalable learning with tens of thousands of parameters.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.