QUICK REVIEW

[论文解读] Finite-Sample Analysis for SARSA with Linear Function Approximation

Shaofeng Zou, Tengyu Xu|arXiv (Cornell University)|Feb 6, 2019

Reinforcement Learning in Robotics被引用 65

一句话总结

该论文提供了在非独立同分布数据与时间变化行为策略下，线性函数近似的在策略SARSA的首次非渐近有限样本分析，以及一个具有有限样本保证的拟合SARSA变体。

ABSTRACT

SARSA is an on-policy algorithm to learn a Markov decision process policy in reinforcement learning. We investigate the SARSA algorithm with linear function approximation under the non-i.i.d.\ data, where a single sample trajectory is available. With a Lipschitz continuous policy improvement operator that is smooth enough, SARSA has been shown to converge asymptotically \cite{perkins2003convergent,melo2008analysis}. However, its non-asymptotic analysis is challenging and remains unsolved due to the non-i.i.d. samples and the fact that the behavior policy changes dynamically with time. In this paper, we develop a novel technique to explicitly characterize the stochastic bias of a type of stochastic approximation procedures with time-varying Markov transition kernels. Our approach enables non-asymptotic convergence analyses of this type of stochastic approximation algorithms, which may be of independent interest. Using our bias characterization technique and a gradient descent type of analysis, we provide the finite-sample analysis on the mean square error of the SARSA algorithm. We then further study a fitted SARSA algorithm, which includes the original SARSA algorithm and its variant in \cite{perkins2003convergent} as special cases. This fitted SARSA algorithm provides a more general framework for extit{iterative} on-policy fitted policy iteration, which is more memory and computationally efficient. For this fitted SARSA algorithm, we also provide its finite-sample analysis.

研究动机与目标

激发对在来自时间变化策略的非独立同分布样本条件下，使用线性函数近似的SARSA收敛速度的理解。
提出一种针对带有时变马尔可夫核的随机近似的新偏差表征。
推导SARSA及广义拟合SARSA算法的有限样本均方误差界。
证明拟合SARSA方案在保持收敛性属性的同时，能在内存和计算效率上更具优势。

提出的方法

引入一种针对带时变马尔可夫转移核的随机近似的新偏差表征技术。
将带线性函数近似的SARSA建模为具有Lipschitz连续性的策略改进算子。
利用梯度下降风格框架和偏差界，给出有限样本分析。
扩展到一个通用的在策略拟合SARSA算法，在策略改进之间加入基于TD(0)的拟合步骤。
推导对于递减和恒定步长的明确有限样本界。

实验结果

研究问题

RQ1在非i.i.d.数据和时间变化的行为策略下，是否可以获得带线性函数近似的在策略SARSA的非渐近收敛性保证？
RQ2时变马尔可夫核带来的随机偏差如何影响收敛以及速率？
RQ3可以为SARSA及广义拟合SARSA算法建立哪些有限样本误差界？
RQ4拟合SARSA框架是否在样本复杂度上具有相同或更优的表现，并可能带来计算上的优势？
RQ5对策略改进（Lipschitz）有哪些条件可以确保收敛并使偏差可控？

主要发现

带线性函数近似的SARSA在递减和恒定步长下获得有限样本均方误差界，展示收敛到极限点theta*并给出量化速率。
在递减步长下，当T很大时，误差的尺度为O(log^3 T / T)，这意味着达到误差delta所需的样本复杂度为O(1/delta * log^3(1/delta))。
在恒定步长下，只要步长足够小且T足够大，算法会收敛到theta*的一个很小的邻域。
对一个通用的在策略拟合SARSA算法进行了分析，显示与SARSA相同的总体O(1/delta log^3(1/delta))样本复杂度，在策略改进之间使用TD迭代时可能带来计算上的收益。
拟合步骤可以在未完全收敛时终止，而不会损害整体收敛性或样本复杂度。
通过辅助的统一遍历链发展了一种时变马尔可夫过程的偏差表征，以实现非渐近分析。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。