QUICK REVIEW

[論文レビュー] Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel|arXiv (Cornell University)|Jul 4, 2020

Reinforcement Learning in Robotics参考文献 49被引用数 37

ひとこと要約

この論文は、占有測度の一般的な凹 utilitiess に対する RL のための Variational Policy Gradient フレームワークを導入し、占有測度の確率的サドル点勾配推定量を導出し、特に標準的なポリシー勾配より改善が見られる特別な場合を含めて、グローバル収束と収束速度を証明します。

ABSTRACT

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

研究の動機と目的

累積報酬を超える状態-行動占有測度の一般的な凹関数ユーティリティを用いた RL 問題のポリシー最適化を動機づける。
勾配を確率的サドル点問題へ変換する variational policy gradient 定理を開発する。
サンプル経路ベースの推定量を提供し、提案手法の収束保証を証明する。
一般的に O(1/t) で、強い凸性のような条件下で指数的収束を含む収束速度を特徴づける。

提案手法

勾配が fenchel 双対のユーティリティのサドル点問題の解になることを示す Variational Policy Gradient Theorem を導出する。
占有測度と lambda の concave functional F(lambda) の観点で問題を定式化する。
任意の関数 z に対して V(theta; z) とその勾配を推定するためにサンプル経路を用いた variational Monte Carlo 勾配推定量を開発する。
勾配推定を計算する primal-dual 確率近似アルゴリズム（Algorithm 1）を提供し、エピソード n に対して O(1/√n) の誤差で。
lambda-space の隠れた凸性を用いた theta の勾配上昇のグローバル収束を証明し、収束速度を確立する。
制約付き MDP、最大探索、デモンストレーションからの学習などの特別なケースを論じる。

実験結果

リサーチクエスチョン

RQ1ベルマン方程式が成り立たない一般的な占有測度の凹関数ユーティリティに対して、ポリシー最適化は効果的に実行できるか？
RQ2目的関数が占有測度の一般的な凹関数である場合、ポリシー勾配をどのように計算・推定するか？
RQ3一般的なユーティリティの下での variational policy gradient 法の収束性と収束速度はどうなるか（累積報酬や強く凹のユーティリティの特別なケースを含む）？

主な発見

Variational Policy Gradient 定理により、勾配はユーティリティの Fenchel 双対を含むサドル点問題の解として得られる。
提案された変分勾配推定量はエピソード数の O(1/√n) 誤差で収束する。
非凸性にもかかわらず variational policy gradient の上昇がグローバル収束を達成し、隠れた凸性の下で O(1/t) の速度を持つ。
累積報酬の特別ケースでは、手法は既知の収束速度を改善し、softmax または natural policy gradient の変種の収束速度に匹敵する。
占有測度に対するユーティリティが強く凹の場合、上昇は指数関数的に速く収束する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。