QUICK REVIEW

[论文解读] Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Andrea Zanette, Emma Brunskill|arXiv (Cornell University)|Jan 1, 2019

Advanced Bandit Algorithms Research参考文献 26被引用 66

一句话总结

论文提出 Euler，一种分段式有限-horizon RL 算法，在与下一个状态值的最大条件方差相关的前提下实现与问题相关的后悔界限，同时在一般情况下与最坏情形界限相匹配。

ABSTRACT

Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question~\cite{jiang2018open} about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no $H$-dependence in the leading term that scales a polynomial function of the number of episodes.

研究动机与目标

在强化学习中阐明需要基于问题结构的后悔界限，以理解比最坏情况分析更难的问题。
提出一种自适应探索的算法（Euler），通过基于方差感知的奖金实现，无需事先的环境知识。
推导依赖于环境方差（Q*）的高概率后悔界限，并在某些奖励有界的设置下展示出与时限无关的行为。
证明该方法在环境范数较低的领域能带来更紧的界限，并解决开放的学习理论问题。

提出的方法

引入 Euler，一种用于有限-horizon MDP 的分段式上下界探索算法。
在不确定性下的乐观性中，使用基于对未来状态值经验方差的贝斯坦型奖金。
加入一个纠正奖金，考虑值函数不确定性以确保乐观性。
将后悔分解为对奖励估计、转移动态估计/乐观性，以及低阶项的分析。
用与问题相关的量 Q* 对主导探索项进行界定，并将其与最大回报 G 相关联。
证明一个最坏情况界限，在主项中达到已知的 O(√(HSAT)) 速率。

实验结果

研究问题

RQ1我们是否能获得依赖于问题结构而非纯粹最坏情况的分段式有限时域 MDP 的后悔界限？
RQ2基于经验贝尔斯坦不等式和价值函数不确定性的探索奖金是否能在无需先验领域知识的情况下产生更紧的、环境相关的后悔界限？
RQ3时限和环境范数如何影响有限时RL中的后悔界限？
RQ4提出的算法能否解决关于带界总奖励的分段式 MDP 的时限性开放问题？

主要发现

Euler 在高概率下实现了与问题相关的后悔上界，形式为 tilde{O}( sqrt(Q*SAT) + sqrt(S)SAH^2 (sqrt{S}+sqrt{H}) )。
给出第二个界限 tilde{O}( sqrt(G^2/H · SAT) + sqrt(S)SAH^2 (sqrt{S}+sqrt{H}) )，通常在 G 较大时对前一界限更紧。
推论表明在某些奖励有界设置下呈现时限无关的行为，在主项上达到极小极大界的界限。
推论 1.1 给出最坏情况界限 tilde{O}( sqrt{HSAT} + sqrt{S}SAH^2 (sqrt{S}+sqrt{H}) )。
推论 1.2 给出利用后继状态值范围 Phi_succ 的界限，与完整的 V^{*} 无关，并且不需要 Phi 或环境范数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。