QUICK REVIEW

[论文解读] Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model

Aaron Sidford, Mengdi Wang|arXiv (Cornell University)|Jun 5, 2018

Machine Learning and Algorithms参考文献 21被引用 30

一句话总结

该论文提出了一种方差减少的Q值迭代算法，在使用生成模型求解折扣率马尔可夫决策过程（DMDPs）时，实现了近似最优的时间复杂度和样本复杂度，以计算出$\epsilon$-最优策略。该方法在对数因子范围内达到样本复杂度的下界，并且其运行时间复杂度与样本复杂度在常数因子范围内相等，因此在$1/\sqrt{(1-\gamma)|\mathcal{S}|} \leq \epsilon \leq 1$的范围内，实现了样本和运行时间的最优性。

ABSTRACT

In this paper we consider the problem of computing an $ε$-optimal policy of a discounted Markov Decision Process (DMDP) provided we can only access its transition function through a generative sampling model that given any state-action pair samples from the transition function in $O(1)$ time. Given such a DMDP with states $S$, actions $A$, discount factor $γ\in(0,1)$, and rewards in range $[0, 1]$ we provide an algorithm which computes an $ε$-optimal policy with probability $1 - δ$ where \emph{both} the time spent and number of sample taken are upper bounded by \[ O\left[\frac{|S||A|}{(1-γ)^3 ε^2} \log \left(\frac{|S||A|}{(1-γ)δε} ight) \log\left(\frac{1}{(1-γ)ε} ight) ight] ~. \] For fixed values of $ε$, this improves upon the previous best known bounds by a factor of $(1 - γ)^{-1}$ and matches the sample complexity lower bounds proved in Azar et al. (2013) up to logarithmic factors. We also extend our method to computing $ε$-optimal policies for finite-horizon MDP with a generative model and provide a nearly matching sample complexity lower bound.

研究动机与目标

填补现有算法在使用生成模型计算$\epsilon$-最优策略时，无法同时实现样本和运行时间最优性的空白。
弥合该问题中已知最优上界与既定样本复杂度下界之间的理论差距。
设计一种算法，在感兴趣的范围内实现样本和运行时间复杂度的最优性，仅在多对数因子范围内。
将该方法扩展至有限horizon MDPs，并提供近乎匹配的样本复杂度下界。
克服先前方法需要$O((1-\gamma)^{-5}\epsilon^{-2})$样本才能达到$\epsilon$-最优性的局限，改善对$(1-\gamma)^{-1}$的依赖关系。

提出的方法

论文提出了一种随机化的方差减少Q值迭代（vQVI）算法，利用方差减少技术提升值迭代过程中的收敛性和稳定性。
在Q值更新步骤中应用方差减少，以降低梯度估计中的噪声，从而实现更少样本下的更快收敛。
每个状态-动作对通过生成模型采样，该模型提供$O(1)$时间访问转移分布的能力。
算法采用稀疏更新策略，以保持低运行时间复杂度，确保总时间与所用样本数成正比。
关键的理论组成部分包括使用集中不等式和鞅论证，以高概率界定向估计值与真实值之间的偏差。
通过使用折扣因子变换建立有限horizon与无限horizon问题之间的对应关系，将该方法扩展至有限horizon MDPs。

实验结果

研究问题

RQ1在使用生成模型的折扣率MDP中，计算$\epsilon$-最优策略的最优样本复杂度是多少？
RQ2能否设计一种算法，使其同时实现最优样本复杂度和最优运行时间复杂度？
RQ3对折扣因子$(1-\gamma)^{-1}$的依赖关系如何影响现有算法的样本和运行时间复杂度？
RQ4计算$\epsilon$-最优策略所需的最少样本数的最紧下界是什么？
RQ5所提出的算法能否扩展至有限horizon MDPs，并实现匹配的样本复杂度下界？

主要发现

所提出的vQVI算法以概率$1-\delta$计算出$\epsilon$-最优策略，所需样本数为$O\left[\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^3\epsilon^2}\log\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\delta\epsilon}\right)\log\left(\frac{1}{(1-\gamma)\epsilon}\right)\right]$。
在每次样本转移耗时$O(1)$的假设下，该算法的运行时间复杂度与其样本复杂度在常数因子范围内相等。
该算法的样本复杂度在对数因子范围内与[AMK13]中已知的样本复杂度下界相匹配。
对于有限horizon MDPs，该方法实现了近乎匹配的样本复杂度下界$\Omega(H^{-3}\epsilon^{-2}|\mathcal{S}||\mathcal{A}|/\log\epsilon^{-1})$。
该算法通过将对$(1-\gamma)^{-1}$的依赖关系降低一个$(1-\gamma)^{-1}$因子，优于先前工作，从而弥合了文献中长期存在的差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。