QUICK REVIEW

[论文解读] Convergent Actor-Critic Algorithms Under Off-Policy Training and Function Approximation

Hamid Reza Maei|arXiv (Cornell University)|Feb 21, 2018

Reinforcement Learning in Robotics参考文献 13被引用 23

一句话总结

本文提出了首个收敛的离策略Actor-Critic算法——梯度Actor-Critic与强调Actor-Critic，这些算法使用状态值函数近似和策略梯度更新，无需额外超参数即可保证收敛。该方法利用平均状态值目标函数的真实梯度，使得在连续或大规模动作空间中实现稳定学习成为可能，而这些场景下Q函数近似会因维度灾难而失效。

ABSTRACT

We present the first class of policy-gradient algorithms that work with both state-value and policy function-approximation, and are guaranteed to converge under off-policy training. Our solution targets problems in reinforcement learning where the action representation adds to the-curse-of-dimensionality; that is, with continuous or large action sets, thus making it infeasible to estimate state-action value functions (Q functions). Using state-value functions helps to lift the curse and as a result naturally turn our policy-gradient solution into classical Actor-Critic architecture whose Actor uses state-value function for the update. Our algorithms, Gradient Actor-Critic and Emphatic Actor-Critic, are derived based on the exact gradient of averaged state-value function objective and thus are guaranteed to converge to its optimal solution, while maintaining all the desirable properties of classical Actor-Critic methods with no additional hyper-parameters. To our knowledge, this is the first time that convergent off-policy learning methods have been extended to classical Actor-Critic methods with function approximation.

研究动机与目标

解决在连续或大规模动作空间中缺乏具备函数近似的收敛离策略Actor-Critic方法的问题。
克服现有离策略策略梯度方法存在的高方差或缺乏收敛保证的局限性。
开发一种在保持经典Actor-Critic方法效率与模块化的同时，支持离策略学习的算法。
在不引入新超参数的前提下确保收敛，以保持经典方法的简洁性。
系统性地将在线策略Actor-Critic方法扩展至离策略学习，使用状态值函数与精确梯度更新。

提出的方法

推导平均状态值函数目标函数的真实梯度，以指导评论家更新，确保收敛。
使用GTD(λ)和强调-TD(λ)算法，通过资格迹实现离策略状态值函数估计。
通过 $ f^\lambda_t $ 和 $ z_t $ 提出一种新颖的资格迹更新方式，以校正离策略分布偏移。
设计评论家更新使用 $ \rho_t \delta_t \psi_t $，其中 $ \psi_t $ 综合了重要性采样、资格迹与策略梯度。
确保每步具有线性时间与内存复杂度，保持在线与增量学习特性。
在标准函数近似假设下，利用鞅与稳定性论证证明收敛性。

实验结果

研究问题

RQ1当对价值网络与策略网络均使用函数近似时，能否使离策略Actor-Critic算法实现收敛？
RQ2在连续或大规模动作空间中，使用状态值函数而非Q函数是否能消除维度灾难？
RQ3在离策略设置中，能否通过资格迹与重要性采样恢复策略目标函数的真实梯度？
RQ4是否可能在不引入除标准学习率外的新超参数的前提下，保持收敛性与效率？
RQ5与先前的离策略Actor-Critic方法（如Off-PAC）相比，所提方法在梯度方向与收敛性方面表现如何？

主要发现

所提出的梯度Actor-Critic与强调Actor-Critic算法是首个在使用函数近似进行离策略训练时保证收敛的算法。
这些算法每步具有线性时间与内存复杂度，可高效扩展至大规模问题。
评论家更新使用策略目标函数的真实梯度，避免了先前方法（如Off-PAC）中出现的方向误差。
当 $ \lambda = 1 $ 时，强调-TD(1)与GTD(1)产生相同解，对应于MSE最优值函数，简化了算法并消除了对 $ \lambda $ 的调优需求。
该方法保持了经典Actor-Critic的所有理想特性：在线、增量式，且无需额外超参数。
通过反例表明，先前的离策略Actor-Critic方法（如Off-PAC）可能沿错误的梯度方向更新，而所提方法避免了此问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。