QUICK REVIEW

[论文解读] Finite-Sample Analysis of Proximal Gradient TD Algorithms

Bo Liu, Ji Liu|arXiv (Cornell University)|Jun 6, 2020

Reinforcement Learning in Robotics参考文献 38被引用 105

一句话总结

这篇论文将梯度TD (GTD) 方法重新表述为真正的随机梯度算法，通过一个原-对偶鞍点目标，给出有限样本性能界限，并提出使用原型镜像映射的加速 GTD 变体。

ABSTRACT

In this paper, we analyze the convergence rate of the gradient temporal difference learning (GTD) family of algorithms. Previous analyses of this class of algorithms use ODE techniques to prove asymptotic convergence, and to the best of our knowledge, no finite-sample analysis has been done. Moreover, there has been not much work on finite-sample analysis for convergent off-policy reinforcement learning algorithms. In this paper, we formulate GTD methods as stochastic gradient algorithms w.r.t.~a primal-dual saddle-point objective function, and then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Two revised algorithms are also proposed, namely projected GTD2 and GTD2-MP, which offer improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis show that the GTD family of algorithms are indeed comparable to the existing LSTD methods in off-policy learning scenarios.

研究动机与目标

说明在离策略 TD 学习中需要真正的随机梯度方法的原因并解决传统 TD 方法的不稳定性。
从鞍点形式推导 GTD/GTD2 以便进行有限样本分析。
开发带投影的 revise GTD 算法以实现界限性和对输出的平均化以提高稳定性。
提出使用随机镜像映射的加速 GTD 变体以改善收敛保证。
提供理论上的有限样本界限并讨论对离策略学习的影响。

提出的方法

将 NEU 和 MSPBE 表述为凸-凹鞍点问题，并展示 GTD 家族收敛到鞍点。
引入一个鞍点目标 L(theta,y)，其中 M = I 或 M = C，以统一 GTD 与 GTD2。
通过 A、b、C 的无偏估计推导 GTD/GTD2 的真正 SG 更新及其有限样本分析。
对 GTD 算法进行带投影至有界可行集合的修订，并输出平均化的迭代值。
应用随机镜像 prox(SMP) 以构建 GTD2-MP 及相关的加速变体。
给出高概率的有限样本界并讨论在 on-policy 和 off-policy 设置下的情况。

实验结果

研究问题

RQ1GTD 和 GTD2 是否可以作为真正的随机梯度方法通过鞍点公式导出？
RQ2在离策略学习中，可以为 GTD/GTD2 建立哪些有限样本性能界？
RQ3近端/镜像映射为基础的更新是否能加速收敛并改进保证？
RQ4在 on-policy 与 off-policy 设置下，有限样本界如何影响梯度 TD 方法？
RQ5哪些实际的修订（投影、均值化）能提高 GTD 算法的稳定性和性能？

主要发现

GTD 和 GTD2 可以被视为具有鞍点目标的真正 SG 方法，从而实现有限样本分析。
在标准假设和轻尾条件下，为鞍点公式推导出有限样本界。
带投影的修订 GTD 算法确保迭代有界并实现高概率误差边界。
GTD-MP 和 GTD2-MP（基于镜像 prox 的方法）相比原始 GTD 家族提供加速收敛的保证。
在 on-policy 设置中，性能误差随样本量和问题条件数的变化而变化，反映出一个依赖于系统常数的界。
在 off-policy 设置中，界取决于行为策略与目标策略之间的距离以及协方差矩阵的条件数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。