QUICK REVIEW

[논문 리뷰] A Finite Time Analysis of Two Time-Scale Actor Critic Methods

Yue Wu, Weitong Zhang|arXiv (Cornell University)|2020. 05. 04.

Reinforcement Learning in Robotics참고 문헌 38인용 수 37

한 줄 요약

논문은 Markovian 샘플을 가진 두 시간 스케일의 actor-critic 방법에 대한 최초의 비점근(non-asymptotic) 분석을 제공하고, 근사 정지점으로의 수렴을 증명하며, 찾기를 위한 효율적인 샘플 복잡도 ䷓e(epsilon^{-2.5}) 를 제시한다 ䷏b0-정지점에 대해.

ABSTRACT

Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\| abla J(\boldsymbolθ)\|_2^2 \le ε$) of the non-concave performance function $J(\boldsymbolθ)$, with $\mathcal{ ilde{O}}(ε^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.

연구 동기 및 목표

비-i.i.d. 데이터 하에서 두 시간 스케일 actor-critic(AC) 알고리즘의 유한 시간 수렴성 연구를 동기화한다.
선형 TD(0) 평가자와 함께 온라인(on-line), 한 단계의 AC 방법에 대한 비점근 수렴 보장을 제시한다.
Markovian 잡음 하에서 actor와 critic 업데이트 간의 상호 작용을 특징지운다.
처음 순서의 정지점에 도달하기 위한 샘플 복잡도와 수렴 속도를 도출한다.
제안된 분석이 비분리(de-coupled) 또는 i.i.d.-가정 설정에 비해 이해를 어떻게 향상시키는지 강조한다.

제안 방법

TD(0) 평가자와 선형 함수 근사를 갖는 고전적 두 시간 스케일 actor-critic 알고리즘을 분석한다.
특징 벡터의 노름이 한정적임을 가정하고 TD(0) 극한점 (c1) 를 행렬 및 벡터 와 함께 확립한다.
실행 단계가 0<c1<1, 0<nu<cs<1 이 조건을 만족하는 비-i.i.d. Markovian 샘플에서 actor의 수렴을 보인다.
가정 4.1-4.3 및 제안 4.4에 따라 정책 매개변수에 대한 크리스마스틱한(연속적인) critic 해의 해를 Lipschitz 연속으로 보인다.
전반적인 수렴 속도 ϕ 를 근사 오차 pp 와 최적화 오차 항으로 표현한 ϕ = (pp) + O(t^{-(1-c)}) + O(( log t)/t^{2}) + O((t)) 를 도출한다.
선택된 s 와 t 에 대해 ps-정지점을 얻기 위한 총 샘플 복잡도는 ϕ = ϕ(pp) + e(5^{-2.5}) 이다.

실험 결과

연구 질문

RQ1두 시간 스케일 actor-critic 방법이 비-i.i.d. (Markovian) 샘플 하에서 선형 함수 근사를 사용해 비점근 수렴을 달성할 수 있는가?
RQ2비-최대화 함수 J(m btheta) 의 ps-정지점에 도달하는 유한 샘플 복잡도는 무엇인가?
RQ3actor 및 critic 스텝 사이즈가 수렴 속도와 전체 샘플 복잡도에 어떤 영향을 미치는가?
RQ4분석이 분리된(actor-critic) 방법 및 i.i.d.-가정 결과와 어떻게 비교되는가?
RQ5대체 정책 평가 스킴 및 비선형 근사기에 이 프레임워크가 확장되는가?

주요 결과

The actor-critic method converges to an ps-approximate stationary point of J with ϕ(ps) = ϕ(pp) + O(t^{-(1-c)}) + O((log t)/t^{}) + O((t)).
With s = O(1/t^{3/5}) for the actor and t = O(1/t^{2/5}) for the critic, the method attains ps-stationarity in T = ϕ(ps) iterations; the per-iteration sample is 1.
The overall (finite-time) sample complexity is ϕ = ϕ(pp) + ilde{O}(ps^{-2.5}).
The analysis handles Markovian noise and removes the need for i.i.d. data assumptions, unlike some prior works.
The authors propose a new proof framework that tightly bounds critic-estimation error and avoids extra artificial factors present in some iterative refinement approaches.
Compared to decoupled actor-critic methods, the two time-scale approach is more sample-efficient, achieving ϕ = ilde{O}(ps^{-2.5}) vs. ϕ = ilde{O}(ps^{-4}) in some decoupled analyses.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.