QUICK REVIEW

[论文解读] Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies

Kaiqing Zhang, Alec Koppel|arXiv (Cornell University)|Jun 19, 2019

Reinforcement Learning in Robotics参考文献 61被引用 44

一句话总结

本文表明随机-horizon 策略梯度方法可以无偏地估计无限-horizon 梯度并收敛到驻点，并引入经周期性放大步长修改的 RPG，以便逃离鞍点并接近局部最优策略，经倒立摆实验验证。

ABSTRACT

Policy gradient (PG) methods are a widely used reinforcement learning methodology in many applications such as video games, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global convergence of PG methods is lacking in the literature. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables the tools from nonconvex optimization to be applied to establish global convergence. Employing this perspective, we first recover the convergence results with rates to the stationary-point policies in the literature. More interestingly, motivated by advances in nonconvex optimization, we modify the proposed PG method by introducing periodically enlarged stepsizes. The modified algorithm is shown to escape saddle points under mild assumptions on the reward and the policy parameterization. Under a further strict saddle points assumption, this result establishes convergence to essentially locally-optimal policies of the underlying problem, and thus bridges the gap in existing literature on the convergence of PG methods. Results from experiments on the inverted pendulum are then provided to corroborate our theory, namely, by slightly reshaping the reward function to satisfy our assumption, unfavorable saddle points can be avoided and better limit points can be attained. Intriguingly, this empirical finding justifies the benefit of reward-reshaping from a nonconvex optimization perspective.

研究动机与目标

激发对无限-horizon MDPs 中策略梯度方法全局收敛的严格理解。
引入 random-horizon Monte-Carlo 展开以获得无偏梯度估计。
将策略梯度收敛性与非凸优化工具联系起来并建立到驻点的收敛速率。
提出带周期性放大步长的 Modified RPG (MRPG)，以逃离鞍点并收敛到本质上局部最优的策略。
从非凸优化视角展示奖励塑形的好处并用实验验证。

提出的方法

定义带随机几何展开时 horizons 的 RPG，以无偏地估计 Q 值和策略梯度。
提供 EstQ 和 EstV 子程序，通过有限时域回合展开生成无偏的 Q 值和价值估计。
推导无偏的策略梯度估计量（包括基线/优势变体）并证明它们的有界性。
利用超鞅（supermartingale）论证 RPG 的渐近收敛到驻点。
提出带周期性放大步长的 Modified RPG (MRPG)，在温和的奖励假设和参数化假设下用于逃离鞍点并收敛。
展示基线如何降低梯度方差并改善收敛。

实验结果

研究问题

RQ1随机-horizon 策略梯度方法是否能渐近收敛到无限-horizon 目标 J(θ) 的驻点？
RQ2在何种条件下，策略梯度方法能够逃离鞍点并收敛到（近似的）二阶驻点？
RQ3奖励塑形和规则化策略参数化是否会影响在强化学习中实现局部最优策略的能力？
RQ4周期性放大步长策略是否能提升非凸强化学习环境下策略梯度方法的收敛性质？

主要发现

带随机时间 horizon 的 RPG 产生无偏梯度估计，且几乎必然收敛到 J(θ) 的驻点。
有限样本分析给出收敛速率，并在标准假设下为 RL 建立一个常数学习率的推论。
在温和的奖励与正则性假设下，带周期性放大步长的 MRPG 能逃离鞍点并收敛到近似二阶驻点。
在实践中，奖励塑形有助于避免不利的鞍点并改善极限解，为非凸优化观点提供经验支持。
在梯度估计中整合基线可降低方差并保持对驻点的收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。