QUICK REVIEW

[论文解读] Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples

Tengyu Xu, Shaofeng Zou|arXiv (Cornell University)|Sep 26, 2019

Advanced Bandit Algorithms Research被引用 44

一句话总结

本论文提供了在非独立同分布的马尔可夫样本下，两时间尺度 TDC 的第一个非渐近收敛分析，推导衰减步长和恒定步长的收敛速率，并提出一个分块渐进下降的方案。

ABSTRACT

Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/(t^(2/3))) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small error at a blockwisely linear convergence rate. Our experiments demonstrate that such an algorithm converges as fast as TDC under constant stepsize, and still enjoys comparable accuracy as TDC under diminishing stepsize.

研究动机与目标

基于线性函数近似的梯度TD来进行离策略值函数评估的动机。
在马尔可夫数据和递减步长下刻画两时间尺度 TDC 的非渐近收敛性。
探讨恒定步长下的行为及由此产生的训练误差/跟踪误差动力学。
提出一种分块渐减步长方案，在任意小的训练误差下实现快速收敛。

提出的方法

为带重要性采样加权的离策略评估建立 MSPBE 目标。
给 θ（慢）和 w（快）定义带投影的两时间尺度随机近似更新。
在递减步长下推导非渐近界，显示到 O(log t / t^{2/3}) 的收敛速率。
给出恒定步长下的非渐近界，表明收敛到 θ* 的邻域，带有显式的偏差-跟踪误差项。
引入分块渐减步长（算法1）并证明分块线性收敛到任意精度。

实验结果

研究问题

RQ1在非独立同分布的马尔可夫样本和递减步长下，两时间尺度 TDC 的非渐近收敛速率是多少？
RQ2恒定步长如何影响两时间尺度 TDC 的训练误差和跟踪误差？
RQ3分块渐减步长方案是否可以在训练误差很小的同时实现快速收敛？
RQ4跟踪误差如何影响两时间尺度 TD 学习中的慢时间尺度训练误差？

主要发现

在递减步长下，θ_t 的收敛速率至多达到 O((log t)/t^{2/3})（当 σ=3ν/2=1 时实现）。
在恒定步长下，θ_t 指向 θ* 的邻域，收敛速度呈指数级，邻域大小由偏差和跟踪误差项决定。
跟踪误差 z_t = w_t − ψ(θ_t) 的衰减速率与 θ_t 不同，原因是条件数不同。
分块渐减步长以分块线性速率实现渐近收敛，训练误差可 arbitrarily 小，样本复杂度略优于标准递减步长。
实验表明分块渐减步长在保持接近渐减步长的准确性的同时，达到与恒定步长相同的速度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。