QUICK REVIEW

[논문 리뷰] Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang, Alec Koppel|arXiv (Cornell University)|2020. 07. 04.

Reinforcement Learning in Robotics참고 문헌 49인용 수 37

한 줄 요약

이 논문은 점유 측정의 일반적인 오목 유틸리티를 갖는 RL에 대한 Variational Policy Gradient 프레임워크를 도입하고, 확률적 사다꼴(saddle-point) 그라디언트 추정기를 도출하며, 일반화된 수렴 속도와 함께 전역 수렴을 보장한다. 특수한 경우에는 표준 정책 그래디언트보다 개선된 수렴을 보인다.

ABSTRACT

In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

연구 동기 및 목표

RL 문제에서 누적 보상 외에 상태-행동 점유 측정의 일반적인 오목 유틸리티에 대한 정책 최적화를 유도한다.
그라디언트를 확률적 사다꼴 문제로 바꾸는 Variational 정책 그래디언스 정리를 개발한다.
제안된 방법에 대해 샘플 경로 기반의 추정기와 수렴 보장을 제공한다.
일반적으로 O(1/t)인 수렴 속도와 강한 볼록성 유사 조건에서 지수적 수렴을 포함한 수렴 속성을 특징짓는다

제안 방법

그라디언트가 점유 측정 lambda의 Fenchel 이중성의 사다꼴 문제의 해가 됨을 보이는 Variational Policy Gradient 정리를 도출한다.
점유 측정과 lambda, 상태-행동 점유 측정 lambda의 함수 F(lambda)와의 볼록성 함수로 문제를 형식화한다.
샘플 경로를 사용해 V(theta; z)와 임의의 함수 z에 대한 그라디언트를 추정하는 Variational Monte Carlo 그라디언트 추정기를 개발한다.
Gradient 추정을 차원 n의 에피소드에서 오차 O(1/√n)로 계산하기 위한 primal-dual 확률적 근사 알고리즘(Algorithm 1)을 제공한다.
lambda 공간에서 숨겨진 볼록성으로 인해 theta의 그래디언트 상승을 전역 수렴시키고 속도를 제시한다.
제약된 MDP, 최대 탐색, 시연 학습과 같은 특수 경우를 논의한다

실험 결과

연구 질문

RQ1벨만 방정식이 성립하지 않는 경우에도 점유 측정의 일반적인 오목 유틸리티에 대해 정책 최적화를 효과적으로 수행할 수 있는가?
RQ2객체가 일반적인 점유 측정의 오목 함수일 때 정책 그래디언트를 어떻게 계산하고 추정하는가?
RQ3일반 유틸리티 하에서 Variational 정책 그래디언트 방법의 수렴 특성 및 속도는 누적 보상이나 강하게 오목한 유틸리티와 같은 특수한 경우를 포함해 무엇인가?

주요 결과

Variational Policy Gradient 정리는 그라디언트가 유틸리티의 Fenchel 이중성을 포함하는 확률적 사다꼴 문제의 해를 통해 얻어질 수 있음을 보여준다.
제안된 변분 그라디언트 추정기는 에피소드 수에서 O(1/√n) 오차로 수렴한다.
숨겨진 볼록성에서도 theta에 대한 변분 정책 그래디언트 상승의 전역 수렴이 설정되며 O(1/t) 속도까지 이끈다.
누적 보상의 특수 경우에서 방법은 알려진 수렴 속도를 개선하여 소프트맥스나 자연 정책 그래디언트 변형과 일치하는 속도를 달성한다.
유틸리티가 점유 측정에 대해 강하게 볼록할 때 상승은 기하급수적으로 빠르게 수렴한다

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.