QUICK REVIEW

[Paper Review] Truly Proximal Policy Optimization

Yuhui Wang, Hao He|arXiv (Cornell University)|Mar 19, 2019

Reinforcement Learning in Robotics25 references32 citations

TL;DR

This paper analyzes PPO’s proximal properties, showing it does not strictly bound likelihood ratios nor enforce a true trust region, and presents Truly PPO with rollback and trust-region-based clipping to guarantee monotonic improvement and improve sample efficiency.

ABSTRACT

Proximal policy optimization (PPO) is one of the most successful deep reinforcement-learning methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, its optimization behavior is still far from being fully understood. In this paper, we show that PPO could neither strictly restrict the likelihood ratio as it attempts to do nor enforce a well-defined trust region constraint, which means that it may still suffer from the risk of performance instability. To address this issue, we present an enhanced PPO method, named Truly PPO. Two critical improvements are made in our method: 1) it adopts a new clipping function to support a rollback behavior to restrict the difference between the new policy and the old one; 2) the triggering condition for clipping is replaced with a trust region-based one, such that optimizing the resulted surrogate objective function provides guaranteed monotonic improvement of the ultimate policy performance. It seems, by adhering more truly to making the algorithm proximal - confining the policy within the trust region, the new algorithm improves the original PPO on both sample efficiency and performance.

Motivation & Objective

Assess whether PPO strictly bounds the likelihood ratio and enforces a trust region constraint.
Investigate the proximal properties of PPO and identify gaps between clipping and trust region theory.
Propose enhancements to PPO that ensure true proximal behavior and monotonic policy improvement.

Proposed method

Introduce rollback operation to counteract incentives pushing the policy outside the clipping range.
Replace clipping trigger with a trust region-based condition to bound KL divergence.
Combine the rollback mechanism with trust-region-based clipping to form Truly PPO with first-order optimization.
Define a new objective that subtracts a KL-based penalty when out of the trust region to promote monotonic improvement.
Provide theoretical guarantees of monotonic improvement for Truly PPO.
Empirically evaluate on benchmark tasks to compare policy performance and sample efficiency.

Experimental results

Research questions

RQ1Does PPO strictly bound the likelihood ratio within its clipping range?
RQ2Can PPO enforce a well-defined trust region constraint like TRPO?
RQ3Can we design a PPO variant that achieves true proximal behavior and monotonic improvement while remaining simple to optimize?
RQ4What benefits do rollback and trust region–based clipping bring to sample efficiency and performance?
RQ5How does Truly PPO compare to TRPO and PPO in theory and practice?

Key findings

PPO does not strictly bound the likelihood ratio within the clipping range in practice.
PPO does not enforce a true trust region constraint, evidenced by unbounded KL divergence under clipping.
Introducing a rollback operation and a trust region–based clipping mechanism yields Truly PPO with monotonic improvement guarantees.
The Truly PPO objective penalizes KL divergence when out of the trust region, promoting proximal updates.
The combination improves policy performance and sample efficiency on benchmark tasks.
Code for implementation is provided by the authors.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.