QUICK REVIEW

[论文解读] Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Mohammad Sadegh Talebi, Odalric-Ambrym Maillard|arXiv (Cornell University)|Mar 5, 2018

Reinforcement Learning in Robotics被引用 19

一句话总结

本文针对平均奖励准则下的非折扣强化学习，提出了一种基于KL-UCRL算法的方差感知后悔界，用与方差相关的项替代了传统的基于直径的边界。关键结果是高概率后悔界为 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$，该结果通过利用偏差函数的局部方差而非MDP直径或动作计数，改进了先前的边界。

ABSTRACT

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\\widetilde {\\mathcal O}\\Bigl({\ extstyle \\sqrt{S\\sum_{s,a}{\\bf V}^\\star_{s,a}T}}\\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\\bf V}^\\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\\widetilde {\\mathcal O}(DS\\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.

研究动机与目标

通过用偏差函数的局部方差替代MDP直径，改进在平均奖励准则下非折扣强化学习的后悔界。
针对遍历MDP中的KL-UCRL算法，提供一种基于方差感知分析的更紧的高概率后悔界。
建立一种新型后悔界，其缩放依赖于状态-动作方差的总和，而非动作数或直径。
提出运输引理的新变体，并应用KL集中不等式，以在MDP中实现更优的分析。
证明新边界在特定基准MDP中相比基于直径的边界可实现数量级的改进。

提出的方法

重新审视非折扣RL的极小极大下界，用偏差函数的局部方差替代MDP直径。
提出一种基于方差感知集中不等式和运输引理变体的KL-UCRL算法新分析方法。
提出一种高概率后悔界，其缩放形式为 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$，其中 $\mathbf{V}^\star_{s,a}$ 表示在状态 $s$ 采取动作 $a$ 时偏差函数的方差。
利用贝尔曼最优方程和偏差函数分解，将后悔与次优性差距及状态访问次数关联起来。
应用Azuma-Hoeffding不等式和Kullback-Leibler集中不等式，以控制值估计中的偏差。
推导出一种后悔分解，将偏差项与次优性差距分离，从而实现基于方差的控制。

实验结果

研究问题

RQ1能否通过用偏差函数的局部方差替代MDP直径，改进非折扣RL的后悔界？
RQ2当应用方差感知分析时，KL-UCRL算法是否能获得更紧的后悔界？
RQ3与经典的 $\widetilde{\mathcal{O}}(DS\sqrt{AT})$ 边界相比，新的基于方差的后悔界在缩放和实际性能方面表现如何？
RQ4新颖的运输引理变体和KL集中不等式是否能在低方差MDP中实现更紧的边界？
RQ5在哪些MDP结构中，方差感知边界相比基于直径的边界能提供显著改进？

主要发现

所提出的后悔界为 $\widetilde{\mathcal{O}}\left(\sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T}\right)$，该边界用状态-动作方差替换了直径 $D$ 和动作数 $A$。
该边界改进了先前已知的 $\widetilde{\mathcal{O}}(DS\sqrt{AT})$ 边界，消除了对 $D$ 和 $A$ 的依赖。
在基准MDP中，由于方差项更低，新边界相比经典边界可实现数量级的改进。
该分析引入了新的运输引理变体和KL集中技术，其本身在本研究之外也具有独立兴趣。
有效后悔受 $D + \sum_{s,a} \mathbb{E}[N_T(s,a)] \varphi(s,a)$ 限制，其中 $\varphi(s,a)$ 为次优性差距，偏差项通过方差感知边界得到控制。
该边界以高概率成立，且在遍历MDP、有限状态和动作空间的假设下推导得出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。