QUICK REVIEW

[Paper Review] Projection-Based Constrained Policy Optimization

Tsung-Yen Yang, Justinian Rosca|arXiv (Cornell University)|Oct 7, 2020

Reinforcement Learning in Robotics19 references81 citations

TL;DR

PCPO is a two-step iterative RL algorithm that first improves reward within a trust region and then projects the policy onto the constraint set to ensure safety or other costs are satisfied, with theoretical guarantees on reward and constraint bounds.

ABSTRACT

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $ ormltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

Motivation & Objective

Motivate learning control policies that maximize reward under predefined safety, fairness, or cost constraints in CMDPs.
Develop a two-step policy update combining reward improvement with constraint projection to maintain feasibility.
Provide theoretical bounds on reward improvement and constraint violation per update.
Offer practical algorithms with convergence guarantees and empirical validation on control tasks.

Proposed method

Two-step update: (1) Reward improvement via TRPO-like step within a KL-divergence trust region.
(2) Projection step that minimizes distance to the intermediate policy while enforcing the constraint via a projected update.
Projection can use either KL divergence in policy space or L2 norm in parameter space.
Theoretical bounds: lower bound on reward improvement and upper bound on constraint violation per update (Theorems 3.1 and 3.2).
Analysis tied to Fisher information (H) and gradient vectors for reward (g) and cost (a), with update rule derived in Equation (6).
Implementation uses conjugate gradient to handle H inversion in high-dimensional policy spaces.

Experimental results

Research questions

RQ1How to reliably maximize reward while satisfying CMDP constraints during learning?
RQ2What are the per-update theoretical bounds on reward improvement and constraint violation for PCPO?
RQ3How do KL-divergence and L2-norm projections compare in terms of convergence and feasibility?
RQ4How does PCPO perform empirically against state-of-the-art constrained RL methods on safety and fairness tasks?

Key findings

PCPO achieves 3.5 times fewer constraint violations and about 15% higher reward on tested tasks.
Two-stage update (reward improvement then projection) maintains feasibility without line search or hyperparameter tuning for constraints.
KL projection and L2 projection converge to different stationary points, with trade-offs in reward stability and constraint satisfaction.
PCPO consistently learns constraint-satisfying policies across all tasks, outperforming CPO and PDO in constraint handling.
Increasing constraint violation b+ worsens worst-case performance bounds, highlighting the importance of the projection step.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.