QUICK REVIEW

[论文解读] On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning

Huizhen Yu|arXiv (Cornell University)|Dec 27, 2017

Reinforcement Learning in Robotics参考文献 25被引用 25

一句话总结

本文为具有线性函数逼近的基于梯度的离策略时序差分（TD）算法建立了收敛性保证，涵盖GTD、镜像下降变体以及单时间尺度的极小化-极大化公式。在通过历史依赖的λ方案实现有界优势迹的前提下，证明了几乎必然收敛；同时指出了在标准递减步长下，无界优势迹时收敛性面临的挑战。

ABSTRACT

We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms, which we call GTD and which minimize the mean squared projected Bellman error using stochastic gradient-descent; (ii) their "robustified" biased variants; (iii) their mirror-descent versions which combine the mirror-descent idea with TD learning; and (iv) a single-time-scale version of GTD that solves minimax problems formulated for approximate policy evaluation. We derive convergence results for three types of stepsizes: constant stepsize, slowly diminishing stepsize, as well as the standard type of diminishing stepsize with a square-summable condition. For the first two types of stepsizes, we apply the weak convergence method from stochastic approximation theory to characterize the asymptotic behavior of the algorithms, and for the standard type of stepsize, we analyze the algorithmic behavior with respect to a stronger mode of convergence, almost sure convergence. Our convergence results are for the aforementioned TD algorithms with three general ways of setting their $λ$-parameters: (i) state-dependent $λ$; (ii) a recently proposed scheme of using history-dependent $λ$ to keep the eligibility traces of the algorithms bounded while allowing for relatively large values of $λ$; and (iii) a composite scheme of setting the $λ$-parameters that combines the preceding two schemes and allows a broader class of generalized Bellman operators to be used for approximate policy evaluation with TD methods.

研究动机与目标

为有限状态MDP中的基于梯度的离策略TD算法与线性函数逼近建立严格的收敛性结果。
分析不同λ参数设置（状态相关、历史依赖、复合方案）对算法收敛性的影响。
研究在三种步长制度下（常数、缓慢递减、标准递减，即平方可 summable）的收敛行为。
将收敛性分析扩展至单时间尺度GTDa的约束与非约束版本，包括有偏变体。
阐明有界优势迹在标准递减步长下实现几乎必然收敛中的作用。

提出的方法

应用随机逼近理论中的弱收敛方法，刻画常数和缓慢递减步长下的渐近行为。
通过随机逼近理论进行几乎必然收敛分析，适用于标准递减步长制度。
提出一种历史依赖的λ方案，可在允许较大λ值的同时确保优势迹有界。
通过将两时间尺度GTD算法的动力学分解为快慢时间尺度的更新，分析其行为。
应用微分包含和马尔可夫链的遍历性性质，研究状态-优势迹过程及其不变测度。
为每种算法推导其均值ODE，并证明收敛至这些ODE的内部链传递不变集。

实验结果

研究问题

RQ1在何种条件下，两时间尺度的基于梯度的TD算法在常数或缓慢递减步长下收敛？
RQ2当在标准递减步长下优势迹无界时，能否为离策略TD算法建立几乎必然收敛？
RQ3一种通过历史依赖的λ方案使优势迹有界，与状态相关或常数λ相比，对收敛性有何影响？
RQ4当将单时间尺度GTDa公式化为极小化-极大化问题时，其收敛行为如何？
RQ5GTD和镜像下降TD算法的有偏变体在相同的步长和λ方案条件下是否保持收敛性？

主要发现

对于常数和缓慢递减步长，所有分析的算法均以分布收敛至其关联均值ODE的内部链传递不变集。
在通过历史依赖λ方案实现优势迹有界的前提下，证明了两时间尺度GTD和MD-GTD在标准递减步长条件下的几乎必然收敛。
当优势迹有界时，证明了标准递减步长下单时间尺度GTDa的收敛性；然而，在状态相关λ下优势迹无界时，分析受到限制。
有偏的GTD和GTDa变体被证明近似为基于梯度的算法，其收敛性取决于迹过程的有界性。
在标准递减步长条件下，只要迹过程保持有界，单时间尺度GTDa的非约束版本的收敛性得以确立。
分析结果确认，有界优势迹在标准递减步长下实现强收敛保证中起关键作用，提示在无界迹情况下可能存在不稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。