QUICK REVIEW

[論文レビュー] Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

Tengyu Xu, Zhe Wang|arXiv (Cornell University)|May 7, 2020

Reinforcement Learning in Robotics参考文献 48被引用数 34

ひとこと要約

この論文は、マルコフ過程サンプリング下で、二時刻ACとNACの初のノンアシンプト、有限サンプル収束率を提供する。ACはε-停留点を達成し、NACは ε グローバル最適解近傍を達成する。

ABSTRACT

As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-2.5}\log^3(ε^{-1}))$ to attain an $ε$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-4}\log^2(ε^{-1}))$ to attain an $ε$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.

研究の動機と目的

ネストされたループ設計の実用的で調整しやすい代替として、二時刻AC/NACの研究を動機づける。
マルコフ過程サンプリング下でのノンアシンプト収束率（サンプル複雑さ）の特性化。
批評家と非線形ポリシー更新における動的マルコフ性バイアスと基底関数の変化に対処。
ACをε-stationary点へ、NACをグローバル最適解のε近傍へ収束する結果を確立。
動的に変化するサンプリングと基底関数のバイアスを分析する手法を提供。

提案手法

AC/NACを速い批評家と遅いアクターを持つ二時刻非線形確率近似としてモデル化する。
動的に変化する基底関数と遷移カーネルを伴う批評家には線形SAを用いる。
ポリシーパラメータ化を伴う減少ステップサイズ下でアクターには非線形SAを用いる。
動的マルコフサンプリングと基底関数の変化に対するバイアスとドリフト境界を導出。
批評家追従誤差とリプシッツ勾配性質を証明して全体の収束率を得る。
明示的なサンプル複雑性を得る：ACは O(ε^{-2.5} log^3(ε^{-1}))、NACは O(ε^{-4} log^2(ε^{-1}))。

実験結果

リサーチクエスチョン

RQ1マルコフ過程サンプリング下で、二時刻ACとNACに対してどの程度の有限サンプル収束率を確立できるか？
RQ2動的マルコフサンプリングと基底関数の変化はバイアスと収束にどう影響するか？
RQ3単一サンプル更新の下で、二時刻AC/NACは既存のネストされたループ設計よりもサンプル複雑性を改善できるか？
RQ4ε正確な停留点（AC）またはグローバル最適解の近傍（NAC）へ到達する厳密なサンプル複雑性は何か？
RQ5非線形ポリシーパラメータ化は二時刻非線形SAの収束解析にどう影響するか？

主な発見

二時刻ACはε-stationary点をサンプル複雑性O(ε^{-2.5} log^3(ε^{-1}))で達成する。
二時刻NACはεグローバル最適解近傍をサンプル複雑性O(ε^{-4} log^2(ε^{-1}))で達成する。
分析は、動的に変化するマルコフサンプリングによるバイアスを線形（批評家）と非線形（アクター）の更新の双方に対して界限づける新しい手法を導入する。
批評家追従誤差はステップサイズパラメータ（σ、ν）に依存して減衰し、σ=1.5ν の場合は対数ターパーを含む。
二時刻ACは全体のサンプル複雑性で単一サンプルのネストされたループACをO(ε^{-0.5})の程度上回る。
二時刻NACは、マルコフ性バイアスに起因する対数因子を除けば、ネストされたループNACとサンプル複雑性の性能を一致させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。