QUICK REVIEW

[論文レビュー] Phasic Policy Gradient

Karl Cobbe, Jacob Hilton|arXiv (Cornell University)|Sep 9, 2020

Reinforcement Learning in Robotics参考文献 20被引用数 49

ひとこと要約

Phasic Policy Gradient (PPG) は policy と value function の訓練を二つの交互のフェーズに分離し、表現を共有しつつ干渉を減らす。これにより Procgen ベンチマークで PPO よりサンプル効率が向上する。さらに、価値関数情報をポリシーネットワークへ蒸留する柔軟な補助フェーズを導入する。

ABSTRACT

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features. PPG also enables the value function to be more aggressively optimized with a higher level of sample reuse. Compared to PPO, we find that PPG significantly improves sample efficiency on the challenging Procgen Benchmark.

研究の動機と目的

Motivate improving sample efficiency in on-policy actor-critic methods by reducing interference between policy and value function objectives.
Propose a two-phase training scheme that preserves shared representations while decoupling optimization.
Introduce an auxiliary distillation phase to transfer value-function knowledge into the policy network.
Demonstrate that decoupled training with PPG yields better sample efficiency than PPO on Procgen environments.

提案手法

Use disjoint policy and value function networks to reduce objective interference.
Policy phase optimizes PPO-style clipped surrogate objective with entropy regularization.
Auxiliary phase distills features by jointly optimizing an auxiliary value head and a cloning objective to align policies, while maintaining fixed value targets.
Auxiliary loss L^{aux} uses the value function error as a training signal to improve representations for the policy.
L^{joint} combines the auxiliary loss with a behavioral cloning term to prevent policy drift, controlled by a clone coefficient.
Include hyperparameters: N_{π}, E_{π}, E_{V}, E_{aux}, β_{clone}, and maintain fixed V-targets across the auxiliary phase.

実験結果

リサーチクエスチョン

RQ1Does decoupling policy and value function optimization reduce interference and improve sample efficiency in on-policy RL?
RQ2How does independent optimization of policy and value functions interact with shared representations in neural networks?
RQ3What is the impact of auxiliary phase frequency and sample reuse on learning efficiency and stability?
RQ4Can a single-network variant with gradient detachment approximate the performance of dual-net PPG architectures?

主な発見

PPG achieves significantly better sample efficiency than PPO on Procgen benchmarks.
Policy sample reuse benefits in PPG are limited when training is decoupled; a single policy epoch is often near-optimal.
Auxiliary phase with more epochs generally helps up to a point, improving representation learning and value estimation.
Frequent auxiliary phases hurt policy optimization due to interference; infrequent auxiliary phases are preferable.
KL penalty and clipping objectives in PPG yield comparable performance under the studied settings.
A single-network PPG variant with gradient detachment closely matches dual-network performance, reducing memory costs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。