QUICK REVIEW

[论文解读] Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Tengyu Xu, Zhe Wang|arXiv (Cornell University)|Apr 27, 2020

Reinforcement Learning in Robotics参考文献 56被引用 23

一句话总结

该论文在马尔可夫采样、小批量更新和通用策略函数逼近的设定下，首次建立了演员-评论家（AC）与自然演员-评论家（NAC）算法的理论样本复杂度改进。结果表明，小批量AC相比策略梯度（PG）的改进因子为$\mathcal{O}((1-\gamma)^{-3})$，小批量NAC相比自然策略梯度（NPG）的改进因子为$\mathcal{O}((1-\gamma)^{-4}\epsilon^{-1}/\log(1/\epsilon))$，证明了在无限时域MDP中，AC/NAC在阶次上优于PG/NPG。

ABSTRACT

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $ε$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(ε^{-1}\log(1/ε))$, and the overall sample complexity for a mini-batch NAC to attain an $ε$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(ε^{-1}/\log(1/ε))$. Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of $\mathcal{O}((1-γ)^{-3})$ and $\mathcal{O}((1-γ)^{-4}ε^{-1}/\log(1/ε))$, respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.

研究动机与目标

为在现实采样与更新机制下，演员-评论家（AC）与自然演员-评论家（NAC）算法的样本复杂度分析填补理论空白。
刻画在马尔可夫采样、小批量更新和通用非线性策略逼近下，AC与NAC的有限样本收敛速率与样本复杂度。
证明在无限时域MDP中，AC与NAC分别在阶次上优于策略梯度（PG）与自然策略梯度（NPG）的样本复杂度。
解决长期存在的理论疑问：AC/NAC是否在样本效率上优于PG/NPG，尤其是在折扣无限时域设定下。

提出的方法

基于每轮迭代中使用单一样本路径进行马尔可夫小批量采样的在线AC与NAC算法进行分析。
提出一种新颖的收敛性分析框架，综合考虑评论家逼近误差、演员逼近误差以及马尔可夫采样效应。
利用李雅普诺夫函数$D(w_t)$，推导出期望策略梯度范数的递归不等式，引入价值函数逼近带来的偏差。
采用步长$\alpha$以平衡收敛性与逼近误差，并推导出期望策略价值差距$J(\pi^*) - \frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[J(\pi_{w_t})]$的界。
引入并界定了演员逼近误差$\zeta^{\text{actor}}_{\text{approx}}$与评论家逼近误差$\zeta^{\text{critic}}_{\text{approx}}$，以量化函数逼近的影响。
通过优化迭代次数$T$、小批量大小$B$与步长$\alpha$，推导出总样本复杂度，以平衡收敛误差与逼近误差。

实验结果

研究问题

RQ1在马尔可夫采样与通用策略逼近下，小批量AC算法是否实现了优于现有AC方法的样本复杂度？
RQ2在无限时域MDP中，NAC算法是否能实现阶次上优于NPG的样本复杂度，尤其是在考虑折扣因子$\gamma$时？
RQ3AC与NAC在经验上表现出的性能优势，是否在样本复杂度层面具有理论依据？
RQ4对$1 - \gamma$的依赖性如何影响AC与NAC相比PG与NPG的样本复杂度？
RQ5当演员与评论家均使用通用非线性函数逼近器时，在马尔可夫采样下，小批量AC与NAC的总样本复杂度是多少？

主要发现

小批量AC达到$\epsilon$-精确平稳点的样本复杂度，相比现有最优界改进了$\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$。
小批量NAC达到$\epsilon$-精确全局最优策略的样本复杂度，相比现有界改进了$\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon))$。
小批量AC的总样本复杂度为$\mathcal{O}\left(\frac{1}{(1-\gamma)^4\epsilon^3}\log(1/\epsilon)\right)$，相比最优已知PG复杂度改进了$\mathcal{O}((1-\gamma)^{-3})$。
小批量NAC的总样本复杂度为$\mathcal{O}\left(\frac{1}{(1-\gamma)^4\epsilon^3}\log(1/\epsilon)\right)$，相比NPG改进了$\mathcal{O}((1-\gamma)^{-4}\epsilon^{-1}/\log(1/\epsilon))$。
本工作首次提供了理论证据，表明AC与NAC在无限时域MDP中，由于评论家的方差减少作用，其样本复杂度在阶次上优于PG与NPG。
分析证实，评论家通过降低梯度方差，为AC与NAC带来了可证明且显著的样本复杂度优势，优于PG与NPG。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。