QUICK REVIEW

[Paper Review] Phasic Policy Gradient

Karl Cobbe, Jacob Hilton|arXiv (Cornell University)|Sep 9, 2020

Reinforcement Learning in Robotics20 references49 citations

TL;DR

Phasic Policy Gradient (PPG) decouples policy and value function training into two alternating phases, sharing representations while reducing interference, leading to improved sample efficiency over PPO on Procgen benchmarks. It also introduces a flexible auxiliary phase for distilling value-function information into the policy network.

ABSTRACT

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we find that PPG significantly improves sample efficiency on the challenging Procgen Benchmark.

Motivation & Objective

Motivate improving sample efficiency in on-policy actor-critic methods by reducing interference between policy and value function objectives.
Propose a two-phase training scheme that preserves shared representations while decoupling optimization.
Introduce an auxiliary distillation phase to transfer value-function knowledge into the policy network.
Demonstrate that decoupled training with PPG yields better sample efficiency than PPO on Procgen environments.

Proposed method

Use disjoint policy and value function networks to reduce objective interference.
Policy phase optimizes PPO-style clipped surrogate objective with entropy regularization.
Auxiliary phase distills features by jointly optimizing an auxiliary value head and a cloning objective to align policies, while maintaining fixed value targets.
Auxiliary loss L^{aux} uses the value function error as a training signal to improve representations for the policy.
L^{joint} combines the auxiliary loss with a behavioral cloning term to prevent policy drift, controlled by a clone coefficient.
Include hyperparameters: N_{π}, E_{π}, E_{V}, E_{aux}, β_{clone}, and maintain fixed V-targets across the auxiliary phase.

Experimental results

Research questions

RQ1Does decoupling policy and value function optimization reduce interference and improve sample efficiency in on-policy RL?
RQ2How does independent optimization of policy and value functions interact with shared representations in neural networks?
RQ3What is the impact of auxiliary phase frequency and sample reuse on learning efficiency and stability?
RQ4Can a single-network variant with gradient detachment approximate the performance of dual-net PPG architectures?

Key findings

PPG achieves significantly better sample efficiency than PPO on Procgen benchmarks.
Policy sample reuse benefits in PPG are limited when training is decoupled; a single policy epoch is often near-optimal.
Auxiliary phase with more epochs generally helps up to a point, improving representation learning and value estimation.
Frequent auxiliary phases hurt policy optimization due to interference; infrequent auxiliary phases are preferable.
KL penalty and clipping objectives in PPG yield comparable performance under the studied settings.
A single-network PPG variant with gradient detachment closely matches dual-network performance, reducing memory costs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.