QUICK REVIEW

[论文解读] Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel|arXiv (Cornell University)|Jul 4, 2020

Reinforcement Learning in Robotics参考文献 49被引用 37

一句话总结

论文提出了一个用于强化学习的变分策略梯度框架，针对占用度量的广义凹效用，推导出随机鞍点梯度估计量，并证明全局收敛及其收敛速率，在某些特殊情况下相对于标准策略梯度有改进。

ABSTRACT

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

研究动机与目标

为带有广义凹效用的状态-动作占用度量的强化学习问题中的策略优化提供动机，超越累计奖励。
提出一个变分策略梯度定理，将梯度转化为一个随机鞍点问题。
提供基于样本路径的估计量并对所提方法给出收敛性保证。
刻画收敛速率，在一般情况下为 O(1/t)，在类似强凸性的条件下达到指数收敛。

提出的方法

推导变分策略梯度定理，表明梯度是涉及效用的 Fenchel 对偶的随机鞍点的解。
将问题表达为占用度量以及对偶体 F(lambda) 是关于 lambda 的凹函数，lambda 为状态-动作占用度量。
开发一个变分蒙特卡洛梯度估计量，使用样本路径来估计 V(theta; z) 及其对任意函数 z 的梯度。
给出一个原-对偶随机逼近算法（算法1），以在回合数 n 下计算梯度估计并达到 O(1/√n) 误差。
证明在 theta 的梯度上升的全局收敛性，利用 lambda 空间中的隐藏凸性，并建立收敛速率。
讨论包括约束 MDP、最大探索以及从演示中学习等特殊情况。

实验结果

研究问题

RQ1在占用度量的广义凹效用下，Bellman 方程不成立的情况下，策略优化能否有效进行？
RQ2当目标是占用度量的广义凹函数时，我们如何计算和估计策略梯度？
RQ3在一般效用下，变分策略梯度方法的收敛性质与速率为何，包括像累计奖励这样的特殊情况或强凹效用？

主要发现

一个变分策略梯度定理表明梯度可以通过涉及效用 Fenchel 对偶的随机鞍点问题来获得。
所提出的变分梯度估计量在回合数 n 上收敛，误差为 O(1/√n)。
尽管存在非凸性，仍建立了变分策略梯度上升的全局收敛性，在隐藏凸性的条件下达到 O(1/t) 的速率。
在累计奖励的特殊情况下，该方法改进了已知的收敛速率，达到 softmax 或自然策略梯度变体的速率。
当效用在占用度量上呈强凹时，上升收敛呈指数级快速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。