QUICK REVIEW

[论文解读] Off-Policy Policy Gradient with State Distribution Correction

Yao Liu, Adith Swaminathan|arXiv (Cornell University)|Apr 17, 2019

Energy, Environment, and Transportation Policies参考文献 27被引用 46

一句话总结

本文提出了 OPPOSD，一种离策略策略梯度方法，考虑行为策略与目标策略之间的状态分布不匹配，具备收敛保证并在基线方法上有经验性改进。

ABSTRACT

We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. Empirically, we compare our method in simulations to several strong baselines which do not correct for this mismatch, significantly improving in the quality of the policy discovered.

研究动机与目标

在马尔可夫决策过程（MDP）中利用离线数据进行序列决策的动机。
解决行为策略和评估策略下的状态分布不匹配问题。
开发具备理论保证的实用离策略策略梯度方法。
展示相对于忽略状态分布校正的基线方法的经验性提升。

提出的方法

建立在状态分布比估计之上以校正梯度。
引入一个增强的 MDP M_mu，以确保乐观且可比较的策略值。
推导出一个离策略策略梯度估计量，包含比率 d^pi(s)/d^mu(s)。
使用平滑化的行为策略以确保覆盖范围，并应用基于 RKHS 的密度比 w(s) 的学习。
实现一个带评估器、密度比估计器 w 和策略梯度更新的 actor-critic 算法（OPPOSD）。
给出收敛性结果，给出在标准假设下的驻点收敛性。

实验结果

研究问题

RQ1纠正状态访问分布不匹配是否可以改善批量离策略策略优化？
RQ2是否有可能在离政策数据中估计策略梯度而不产生指数级方差增长？
RQ3在基准领域中，状态分布校正是否能带来相对于 Off-PAC 及其他基线的经验改进？
RQ4所提出的估计量能否与具有收敛性保证的稳定 actor-critic 优化相结合？

主要发现

OPPOSD 在 CartPole 和 HIV 治疗模拟器中比 Off-PAC 和行为策略具有更高的策略性能。
在所给示例中纠正状态分布不匹配显著提高梯度估计质量，而 Off-PAC 失败。
只要密度比和评估器估计具有消失的估计误差，算法收敛到驻点。
实验表明结合密度比校正和离策略评估可以在优化过程中识别出良好策略。
状态分布校正并不需要梯度估计中的高额方差增长。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。