QUICK REVIEW

[论文解读] True Online Temporal-Difference Learning

Harm van Seijen, A. Rupam Mahmood|arXiv (Cornell University)|Dec 13, 2015

Reinforcement Learning in Robotics参考文献 18被引用 56

一句话总结

本文提出了一种新型算法——真实在线时序差分学习（True Online Temporal-Difference Learning），通过在标准TD(λ)更新中引入两项关键修改，使该算法在每个时间步均与TD(λ)的前向视图保持精确等价。在随机马尔可夫奖励过程、肌电假肢手臂以及Atari环境中的实证结果表明，真实在线TD(λ)与Sarsa(λ)在学习速度上始终优于标准版本，且无性能下降，同时消除了在累积型与替换型时序痕迹之间进行选择的需要。

ABSTRACT

The temporal-difference methods TD($λ$) and Sarsa($λ$) form a core part of modern reinforcement learning. Their appeal comes from their good performance, low computational cost, and their simple interpretation, given by their forward view. Recently, new versions of these methods were introduced, called true online TD($λ$) and true online Sarsa($λ$), respectively (van Seijen & Sutton, 2014). These new versions maintain an exact equivalence with the forward view at all times, whereas the traditional versions only approximate it for small step-sizes. We hypothesize that these true online methods not only have better theoretical properties, but also dominate the regular methods empirically. In this article, we put this hypothesis to the test by performing an extensive empirical comparison. Specifically, we compare the performance of true online TD($λ$)/Sarsa($λ$) with regular TD($λ$)/Sarsa($λ$) on random MRPs, a real-world myoelectric prosthetic arm, and a domain from the Arcade Learning Environment. We use linear function approximation with tabular, binary, and non-binary features. Our results suggest that the true online methods indeed dominate the regular methods. Across all domains/representations the learning speed of the true online methods are often better, but never worse than that of the regular methods. An additional advantage is that no choice between traces has to be made for the true online methods. Besides the empirical results, we provide an in-depth analysis of the theory behind true online temporal-difference learning. In addition, we show that new true online temporal-difference methods can be derived by making changes to the online forward view and then rewriting the update equations.

研究动机与目标

为解决标准TD(λ)和Sarsa(λ)在理论与实证上的局限性，这些方法仅在步长趋近于零的极限下近似前向视图。
开发一种方法，确保在每个时间步均与前向视图保持精确等价，从而实现对偏差-方差权衡的完全控制。
通过实证评估，验证真实在线TD(λ)的改进理论特性是否能在多样化领域和函数逼近设置中转化为更优性能。
证明该方法可消除在累积型与替换型时序痕迹之间进行选择的需要，从而简化实际部署。

提出的方法

提出一种基于有界λ回报的新型在线前向视图，该回报随时间逐步增长，从而支持在线更新。
直接从该在线前向视图推导出真实在线TD(λ)的更新公式，确保每一步均保持精确等价。
通过引入基于当前与前一时刻权重向量投影差异的校正项，对标准TD(λ)更新进行修改，并利用时序痕迹。
通过递归更新方式维护时序痕迹：$\mathbf{e}_t = \gamma\lambda\mathbf{e}_{t-1} + \bm{\phi}_t - \alpha\gamma\lambda(\mathbf{e}_{t-1}^\top\bm{\phi}_t)\bm{\phi}_t$，从而实现精确的在线计算。
将相同的推导框架应用于推导真实在线Sarsa(λ)，以确保在离策略学习中保持前向视图等价性。
采用线性函数逼近法，结合表格型、二值型及非二值型特征，评估不同表示类型下的泛化能力。

实验结果

研究问题

RQ1真实在线TD(λ)是否在多样化环境与函数逼近方案下，均实现比标准TD(λ)更快的学习速度？
RQ2真实在线TD(λ)是否能在非无穷小步长下，仍保持与前向视图的精确等价性？
RQ3该方法是否如作者所声称的那样，消除了在累积型与替换型时序痕迹之间进行选择的需要？
RQ4真实在线Sarsa(λ)在控制任务中的学习速度与性能表现，相较于标准Sarsa(λ)如何？
RQ5所提出的在线前向视图框架是否可推广至推导其他真实在线时序差分算法？

主要发现

真实在线TD(λ)在所有测试领域（包括随机MRP、肌电假肢手臂及Atari环境）中，始终实现比标准TD(λ)更快的学习速度。
在每个测试环境与表示类型（表格型、二值型、非二值型特征）下，真实在线TD(λ)从未表现差于标准TD(λ)，且在收敛速度上通常显著更优。
即使在中等步长下，该方法仍能实现与前向视图的精确等价性，而标准TD(λ)仅在步长趋近于零时近似实现此等价性。
真实在线Sarsa(λ)在控制任务（如假肢手臂与Atari）中，相较于标准Sarsa(λ)（无论采用累积型或替换型时序痕迹）均表现出更优性能。
该算法消除了在累积型与替换型时序痕迹之间进行选择的需要，因为其更新规则通过源自在线前向视图的推导，天然地处理了这两种情况。
在具有不同参数（k=10, k=100, b=3, b=10, σ=0.1, σ=0）的随机MRP上的实证结果，证实了该真实在线方法在不同噪声与复杂度水平下均保持一致的优越性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。