QUICK REVIEW

[论文解读] Best Policy Identification in discounted MDPs: Problem-specific Sample Complexity

Aymen Al Marjani, Alexandre Proutière|arXiv (Cornell University)|Jan 1, 2020

Reinforcement Learning in Robotics被引用 3

一句话总结

该论文提出KLB-TS，一种用于带生成模型的折扣MDP中最佳策略识别的新算法，通过求解一个非凸优化问题得到的问题特定样本复杂度下界实现。该算法通过追踪依赖于MDP特定泛函（如次优间隙和值函数方差）的近似最优样本分配，实现渐近最优的样本复杂度。

ABSTRACT

We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) with finite state and action spaces. We assume that the agent has access to a generative model and that the MDP possesses a unique optimal policy. In this setting, we derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimal gaps and the variance of the next-state value function, and thus really summarizes the hardness of the MDP. We devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are finally discussed.

研究动机与目标

建立带唯一最优策略的折扣MDP中最佳策略识别的样本复杂度问题特定下界。
推导该样本复杂度下界的紧致上界，从而实现显式且近似最优的样本分配。
设计一种算法KLB-TS，实时追踪此近似最优的样本分配。
提供KLB-TS样本复杂度的渐近保证——几乎必然和期望意义下的保证。
展示KLB-TS在样本效率方面相较于最先进算法的优势。

提出的方法

通过在状态-动作对上求解一个非凸优化程序，推导出样本复杂度的问题特定下界。
引入该下界的一个紧致上界，该上界显式依赖于MDP特定泛函：次优间隙和下一状态值函数的方差。
提出KLB-TS（KL球追踪-停止）算法，动态追踪由上界导出的近似最优样本分配。
使用基于KL散度的追踪-停止策略，以在策略识别中平衡探索与利用。
采用基于上界导出的置信区间构造的停止规则，以确保渐近最优性。
提供理论保证，表明KLB-TS在几乎必然和期望样本复杂度上均实现渐近最优。

实验结果

研究问题

RQ1在具有生成模型的折扣MDP中，识别最优策略的样本复杂度的根本问题特定下界是什么？
RQ2如何紧密逼近该下界，以获得实际可用的、近似最优的样本分配策略？
RQ3能否设计一种算法，在实时追踪此近似最优分配的同时保持渐近最优性？
RQ4此类算法的样本复杂度在几乎必然和期望意义下的理论保证是什么？
RQ5所提出的算法在样本效率方面相较于现有最先进方法表现如何？

主要发现

论文建立了依赖于MDP特定泛函（如次优间隙和下一状态值函数方差）的样本复杂度问题特定下界。
推导出该下界的紧致上界，使得无需求解原始非凸程序即可实现显式且近似最优的样本分配。
KLB-TS在所推导的分配策略下，被证明在样本复杂度上实现渐近最优性，包括几乎必然和期望意义。
该算法的设计利用了基于KL散度引导的追踪-停止机制，确保高效探索和及时停止。
通过理论分析和对比，KLB-TS在样本效率方面优于最先进算法。
该理论框架为基于MDP内在结构量化最佳策略识别难度提供了一种系统性方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。