[Paper Review] Off-Policy Policy Gradient with State Distribution Correction
The paper introduces OPPOSD, an off-policy policy gradient method that accounts for state distribution mismatch between behavior and target policies, with a convergence guarantee and empirical improvements over baselines.
We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. Empirically, we compare our method in simulations to several strong baselines which do not correct for this mismatch, significantly improving in the quality of the policy discovered.
Motivation & Objective
- Motivation to leverage offline data for sequential decision making in MDPs.
- Address mismatch between state distributions under behavior and evaluation policies.
- Develop a practical off-policy policy gradient method with theoretical guarantees.
- Demonstrate empirical gains over baselines that ignore state distribution correction.
Proposed method
- Builds on state-distribution ratio estimation to correct gradients.
- Introduces an augmented MDP M_mu to ensure optimistic yet comparable policy values.
- Derives an off-policy policy gradient estimator incorporating the ratio d^pi(s)/d^mu(s).
- Uses a smoothed behavior policy to ensure coverage and applies RKHS-based learning for density ratio w(s).
- Implements an actor-critic algorithm (OPPOSD) with a critic, density ratio estimator w, and policy gradient updates.
- Provides convergence results showing stationary-point convergence under standard assumptions.
Experimental results
Research questions
- RQ1Can correcting for the state visitation distribution mismatch improve batch off-policy policy optimization?
- RQ2Is it possible to estimate policy gradients from off-policy data without exponential variance growth?
- RQ3Do state distribution corrections yield empirical gains over Off-PAC and other baselines in benchmark domains?
- RQ4Can the proposed estimator be integrated with a stable actor-critic optimization with convergence guarantees?
Key findings
- OPPOSD achieves higher policy performance than Off-PAC and the behavior policy in CartPole and HIV treatment simulators.
- Correcting for state distribution mismatch significantly improves gradient estimation quality in the presented example where Off-PAC fails.
- The algorithm converges to a stationary point provided density ratio and critic estimates have vanishing estimation error.
- Experiments show that incorporating density ratio corrections and off-policy evaluation can identify good policies during optimization.
- The state distribution correction does not require prohibitive variance increases in gradient estimates.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.