QUICK REVIEW

[论文解读] Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Yue Wang, Wei Chen|arXiv (Cornell University)|Dec 1, 2017

Reinforcement Learning in Robotics被引用 25

一句话总结

本文首次对马尔可夫设定下的基于梯度的时间差分（GTD）策略评估算法进行了有限样本分析，推导出依赖于马尔可夫过程混合时间的收敛界。结果表明，GTD算法在不同步长下可实现收敛，并解释了经验回放通过改善混合特性从而加快收敛速度的有效性。

ABSTRACT

In reinforcement learning (RL), one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.

研究动机与目标

为解决在现实的马尔可夫过程数据生成假设下，GTD算法缺乏有限样本分析的问题。
在马尔可夫设定下，推导一般凸-凹鞍点问题的有限样本界。
理解步长和混合时间如何影响GTD算法的收敛性。
解释经验回放为何能通过改善混合特性来提升GTD算法的收敛性能。
为在非独立同分布（non-i.i.d.）数据下实际强化学习设置中的GTD算法提供理论基础。

提出的方法

作者将GTD算法建模为马尔可夫数据下的凸-凹鞍点问题。
利用集中不等式和混合时间分析，推导鞍点问题的有限样本界。
将底层马尔可夫过程的混合时间作为收敛速率中的关键系数纳入分析。
分析步长调度策略，以在不同条件下建立收敛性。
通过经验回放能够改善马尔可夫过程的混合特性，解释其有效性。
利用随机逼近和马尔可夫链理论的工具推导理论结果。

实验结果

研究问题

RQ1在马尔可夫数据生成设定下，GTD算法的有限样本收敛界表现如何？
RQ2马尔可夫过程的混合时间在决定GTD算法收敛速率中起什么作用？
RQ3在马尔可夫设定下，步长的选择如何影响GTD算法的收敛性？
RQ4为何经验回放能有效提升GTD算法的性能？
RQ5当数据非独立同分布时，能否为GTD算法严格推导出有限样本界？

主要发现

GTD算法在马尔可夫设定下，采用多种步长调度策略时均可收敛，扩展了以往基于独立同分布（i.i.d.）数据的结果。
GTD算法的收敛速率与底层马尔可夫过程的混合时间成正比。
混合速度更快的马尔可夫过程可带来GTD算法更快的收敛速度。
经验回放通过增强数据生成马尔可夫链的混合特性，从而改善收敛性能。
本工作首次在马尔可夫设定下对GTD算法进行了有限样本分析，填补了关键的理论空白。
理论界通过将GTD转化为凸-凹鞍点问题，并分析其在马尔可夫采样下的有限时间行为而推导得出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。