QUICK REVIEW

[论文解读] Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Christoph Dann, Tor Lattimore|arXiv (Cornell University)|Mar 22, 2017

Advanced Bandit Algorithms Research参考文献 23被引用 60

一句话总结

本文提出 Uniform-PAC 这一框架，将 PAC 与在 episodic RL 的 regret 统一起来，并给出 UBEV，一种乐观算法，在使用 time-uniform Law of Iterated Logarithm confidence bounds 时实现近似最优的 Uniform-PAC 与 regret 边界。

ABSTRACT

Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.

研究动机与目标

在 episodic RL 中，说明需要对所有 epsilon 水平的误差同时给出上界的性能保证。
将 Uniform-PAC 定义为对 PAC 的强大、时间一致的扩展，能够蕴含高概率的 regret 上界。
开发一个算法，使其实现 Uniform-PAC，同时提供近似最优的 PAC 与 regret 保证。
给出理论分析，表明 Uniform-PAC 在高概率下意味着收敛到最优策略。

提出的方法

将 Uniform-PAC 作为一个框架引入，并将其与 PAC 和 regret 保证相关联。
提出 UBEV，一种乐观的 RL 算法，使用 time-uniform、Law-of-Iterated-Logarithm (LIL) 的置信区间。
用时间相关的动态建模 episodic 固定 horizon 的 MDP，并对转移与奖励使用含置信区间的向后推断。
使用基于 LIL 的置信宽度 phi(s,a,t) = sqrt((2 ln ln max{e,n(s,a,t)}) + ln(18SAH/δ)) / sqrt(n(s,a,t)).
证明 UBEV 实现 Uniform-PAC 边界以及近似最优的 regret，而样本复杂度和对冲依赖如定理 4 所述。

实验结果

研究问题

RQ1Uniform-PAC 是否能够在 episodic RL 中对所有 epsilon 水平同时提供高概率保证？
RQ2一个算法是否可以同时具备 Uniform-PAC，并达到近似最优的 PAC 与 regret 保证？
RQ3哪种置信区间构造能够在 RL 中实现统一、对时间无害的保证？
RQ4Uniform-PAC 保证与现有的 episodic MDPs 中的 PAC 与 regret 概念有何关系？
RQ5将 PAC 或 regret 保证转化为 Uniform-PAC 保证的理论极限是什么？

主要发现

UBEV 是 Uniform-PAC，其 epsilon 错误次数的界限随 O(SAGH^4/ε^2) 的量级并乘以多对数因子增长。
在至少 1−δ 的概率下，UBEV 确保 regret R(T) = O(H^2(√(SAT) + S^3A^2) polylog(S,A,H,T)).
Uniform-PAC 保证意味着以高概率收敛到最优策略，并给出 Uniform 高概率 regret 界。
Uniform-PAC 明显强于 PAC 与高概率 regret，在适用时同时蕴含两者。
UBEV 使用时间一致的 LIL 置信界，其收缩为 sqrt((log log n)/n)，从而实现对所有剧集的统一保证。
这些界限相比先前的 MBIE 型结果在减少 horizon 依赖方面有所改进，并在 S、A、H 的依赖上达到近似最优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。