QUICK REVIEW

[논문 리뷰] Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies

Kaiqing Zhang, Alec Koppel|arXiv (Cornell University)|2019. 06. 19.

Reinforcement Learning in Robotics참고 문헌 61인용 수 44

한 줄 요약

본 논문은 랜덤 지평 정책 경사 방법이 무한 지평 그래디언트를 편향 없이 추정하고 정지점으로 수렴함을 보이며, 주기적으로 확장된 학습률로 수정된 RPG를 도입하여 saddle point를 벗어나 국소적으로 최적에 가까운 정책에 접근하도록 하는 방법을 제시하고 역진자 실험으로 검증한다.

ABSTRACT

Policy gradient (PG) methods are a widely used reinforcement learning methodology in many applications such as video games, autonomous driving, and robotics. In spite of its empirical success, a rigorous understanding of the global convergence of PG methods is lacking in the literature. In this work, we close the gap by viewing PG methods from a nonconvex optimization perspective. In particular, we propose a new variant of PG methods for infinite-horizon problems that uses a random rollout horizon for the Monte-Carlo estimation of the policy gradient. This method then yields an unbiased estimate of the policy gradient with bounded variance, which enables the tools from nonconvex optimization to be applied to establish global convergence. Employing this perspective, we first recover the convergence results with rates to the stationary-point policies in the literature. More interestingly, motivated by advances in nonconvex optimization, we modify the proposed PG method by introducing periodically enlarged stepsizes. The modified algorithm is shown to escape saddle points under mild assumptions on the reward and the policy parameterization. Under a further strict saddle points assumption, this result establishes convergence to essentially locally-optimal policies of the underlying problem, and thus bridges the gap in existing literature on the convergence of PG methods. Results from experiments on the inverted pendulum are then provided to corroborate our theory, namely, by slightly reshaping the reward function to satisfy our assumption, unfavorable saddle points can be avoided and better limit points can be attained. Intriguingly, this empirical finding justifies the benefit of reward-reshaping from a nonconvex optimization perspective.

연구 동기 및 목표

무한-지평 MDP에서 정책 그래디언트 방법의 글로벌 수렴에 대한 엄밀한 이해를 촉진한다.
편향 없는 기울기 추정을 얻기 위해 랜덤 기하 롤아웃을 도입한다.
정책 그래디언트 수렴을 비대립성 최적화 도구에 연결하고 정지점으로의 수렴 속도를 확립한다.
saddle 포인트를 벗어나 근사적으로 국소적으로 최적에 가까운 정책으로 수렴하기 위한 주기적으로 확장된 학습률을 갖는 수정된 RPG(MRPG)를 제안한다.
비대립 최적화 관점에서 보상 형상화의 이점을 Demonstrate하고 실험으로 검증한다.]
method:[
Define RPG with random geometric rollout horizons to unbiasedly estimate Q and policy gradient.

제안 방법

Q 및 정책 기울기를 편향 없이 추정하기 위해 랜덤 기하급수적 롤아웃 지평을 갖는 RPG를 정의한다.
유한 지평 롤아웃을 통해 편향 없는 Q 및 가치 추정을 생성하는 EstQ 및 EstV 하위 루틴을 제공한다.
기저선/어드밴티지 변형을 포함한 편향 없는 정책-경사 추정치를 도출하고 그 유계성을 증명한다.
슈퍼마르간일 논거를 사용하여 RPG의 점근적 수렴을 정지점으로 보장한다.
완만한 보상 및 매개변수 가정 하에 saddle points를 벗어나기 위해 주기적으로 확장된 학습률을 갖는 수정된 RPG(MRPG)를 제안한다.
기저선이 그래디언트 분산을 감소시키고 수렴을 개선하는 방법을 제시한다.

실험 결과

연구 질문

RQ1랜덤-지평 정책 경사 방법이 무한-지평 목적어 J(θ)의 정지점으로 점근적으로 수렴할 수 있는가?
RQ2정책 그래디언트 방법이 saddle 점을 벗어나 (근사적으로) 2차 정지점에 수렴할 수 있는 조건은 무엇인가?
RQ3보상 형상화와 일반적인 정책 매개화가 RL에서 국소적으로 최적의 정책을 달성하는 능력에 영향을 미치는가?
RQ4주기적으로 확장된 학습률 전략이 비대립 RL 설정에서 정책 그래디언트 방법의 수렴 특성을 향상시키는가?

주요 결과

랜덤 지평을 가진 RPG는 편향 없는 그래디언트 추정치를 생성하고 J(θ)의 정지점으로 거의 확실하게 수렴한다.
유한 표본 분석은 수렴 속도를 제시하고 표준 가정 하에 RL에 대한 일정한 학습률의 보론을 확립한다.
주기적으로 확장된 학습률을 가진 MRPG는 mild 보상 및 정규성 가정 하에서 saddle 점을 벗어나 근사적인 2차 정지점으로 수렴할 수 있다.
실무에서 보상 재구성은 불리한 saddle 점을 피하고 극한 해를 개선하는 데 도움을 주며 비대립 최적화 관점에 대한 실증적 근거를 제공한다.
그래디언트 추정에서 기저선을 통합하면 분산을 감소시키고 정지점으로의 수렴을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.