QUICK REVIEW

[论文解读] Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms

Tengyu Xu, Zhe Wang|arXiv (Cornell University)|May 7, 2020

Reinforcement Learning in Robotics参考文献 48被引用 34

一句话总结

本论文首次给出两时间尺度的 Actor-Critic（AC）与 Natural Actor-Critic（NAC）在马尔可夫采样下的非渐近、有限样本收敛速率，其中 AC 在 ε-近似驻点达到收敛，NAC 在 ε 全局最优邻域内达到收敛。

ABSTRACT

As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-2.5}\log^3(ε^{-1}))$ to attain an $ε$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(ε^{-4}\log^2(ε^{-1}))$ to attain an $ε$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.

研究动机与目标

研究两时间尺度的 AC/NAC 作为嵌套循环设计的实际、易调整的替代方案的动机。
在马尔可夫采样下表征非渐近收敛速率（样本复杂度）。
处理 critic 和非线性 actor 更新中的动态马尔可夫偏差与基函数变更。
建立 AC 收敛到 ε-驻点和 NAC 收敛到全局最优近邻的结果。
提供分析动态变更的采样与基函数偏差的技术。

提出的方法

将 AC/NAC 模型化为具有快速 critic 与缓慢 actor 的两时间尺度非线性随机逼近。
对 critic 使用线性 SA，伴随动态变化的基函数与转移核。
在策略参数化下，对 actor 使用以步长逐渐减小的非线性 SA。
推导动态马尔可夫采样与基函数变更下的偏差与漂移界。
证明 critic 跟踪误差界与梯度 Lipschitz 性质以获得整体收敛速率。
得到明确的样本复杂度：AC 为 O(ε^{-2.5} log^3(ε^{-1}))，NAC 为 O(ε^{-4} log^2(ε^{-1}))。

实验结果

研究问题

RQ1在马尔可夫采样下，两时间尺度的 AC/NAC 能达到哪些有限样本收敛速率？
RQ2动态马尔可夫采样与基函数变更如何影响偏差与收敛性？
RQ3在单样本更新的情况下，两时间尺度的 AC/NAC 是否能比现有的嵌套循环设计获得更好的样本复杂度？
RQ4达到 ε 精度的驻点（AC）或全局最优邻域（NAC）的确切样本复杂度是多少？
RQ5非线性策略参数化如何影响两时间尺度非线性 SA 的收敛分析？

主要发现

两时间尺度的 AC 在 ε-驻点处达到收敛，样本复杂度为 O(ε^{-2.5} log^3(ε^{-1})).
两时间尺度的 NAC 在 ε 全局最优邻域处达到收敛，样本复杂度为 O(ε^{-4} log^2(ε^{-1})).
分析中引入新技术以界定由于动态变化的马尔可夫采样对线性（critic）和非线性（actor）更新的偏差。
critic 跟踪误差按步长参数（σ, ν）衰减，当 σ=1.5ν 时含有对数逐步衰减项。
两时间尺度的 AC 在总体样本复杂度上比单次样本的嵌套循环 AC 的性能提高了 O(ε^{-0.5}) 的量级。
两时间尺度的 NAC 在样本复杂度上与嵌套循环 NAC 的性能相匹配，最多在马尔可夫偏差带来的一对数因子下有差异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。