QUICK REVIEW

[论文解读] Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Ruosong Wang, Simon S. Du|arXiv (Cornell University)|May 1, 2020

Reinforcement Learning in Robotics参考文献 36被引用 23

一句话总结

该论文通过证明在规划时长远小于多项式依赖于 H 的情况下，表格式 episodic 强化学习的样本复杂度呈对数增长，从而解决了 COLT 2018 的一个开放问题。作者提出了 Online Trajectory Synthesis 算法以及最优策略的 ε-网构造方法，表明当奖励被归一化到 [0,1] 时，在最小最大意义下，长时域强化学习并不比短时域强化学习更困难。

ABSTRACT

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

研究动机与目标

解决 COLT 2018 开放问题，即长时域强化学习的样本复杂度是否随规划时长远小于 H 的多项式增长。
挑战既有的猜想，即由于对 H 的多项式依赖，长时域强化学习本质上比短时域强化学习更困难。
设计一种可证明高效的表格式 episodic 强化学习算法，其样本复杂度仅随 H 的对数增长。
构建一个最优策略的 ε-网，其对数覆盖数仅随 H 的对数增长，从而实现高效的策略评估。
证明在奖励被归一化为 [0,1] 的条件下，长时域强化学习在最小最大意义下并不比上下文Bandits（H=1）更复杂。

提出的方法

提出 Online Trajectory Synthesis 算法，通过与策略类的对数覆盖数成比例的样本复杂度，自适应地评估给定类中的所有策略。
构建最优策略集合的 ε-网，其对数覆盖数随规划时长远小于 H 的对数增长。
采用归一化奖励设置，使每条轨迹的累积奖励被限制在 [0,1] 范围内，从而实现在不同时间跨度间的公平比较。
应用浓度不等式和高概率界，确保策略的估计值以高概率在真实值的 ε 范围内。
证明该算法以至少 1−δ 的概率返回一个 ε-最优策略，且所需轨迹数为 poly(|S|, |A|, log H, 1/ε, log(1/δ))。
利用 episodic MDP 的结构和非负奖励特性，对估计误差进行有界控制，确保收敛到近似最优策略。

实验结果

研究问题

RQ1表格式 episodic 强化学习的样本复杂度是否如 Jiang 和 Agarwal (2018) 所猜想的那样，随规划时长远小于 H 的多项式增长？
RQ2能否设计一种可证明高效的长时域强化学习算法，其样本复杂度仅随 H 的对数增长？
RQ3当奖励被归一化为 [0,1] 时，长时域强化学习与短时域强化学习（如上下文Bandits）在难度上是否存在本质差异？
RQ4能否构造一个最优策略的 ε-网，使其对数覆盖数随 H 的对数增长？
RQ5是否可能实现表格式 episodic 强化学习的最小最大最优样本复杂度，使其在 H 上为多对数复杂度且与 H 的多项式依赖无关？

主要发现

所提出的 Online Trajectory Synthesis 算法的样本复杂度仅随规划时长远小于 H 的对数增长，而非多项式增长。
该论文否定了既有的猜想，即由于对 H 的多项式依赖，长时域强化学习本质上比短时域强化学习更困难。
最优策略 ε-网的对数覆盖数随 H 的对数增长，从而支持高效的策略评估。
该算法以至少 1−δ 的概率返回一个 ε-最优策略，所需轨迹数为 O(poly(|S|, |A|, log H, 1/ε, log(1/δ)))。
该结果表明，在最小最大意义下，当奖励被归一化为 [0,1] 时，长时域强化学习并不比短时域强化学习更困难。
作者推测，表格式 episodic 强化学习的最小最大最优样本复杂度为 Õ(|S||A|poly(log H)/ε²)，这意味着不存在与时延相关的困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。