QUICK REVIEW

[论文解读] Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

Alekh Agarwal, Sham M. Kakade|arXiv (Cornell University)|Jun 10, 2019

Reinforcement Learning in Robotics参考文献 22被引用 33

一句话总结

该论文证明，使用生成模型的朴素插件式、基于模型的规划方法在获得 ε-最优策略方面非渐进地对抗极小极大最优，并分析其样本和计算复杂度。

ABSTRACT

This work considers the sample and computational complexity of obtaining an $ε$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this work, we study the effectiveness of the most natural plug-in approach to model-based planning: we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP. We ask arguably the most basic and unresolved question in model based planning: is the naive "plug-in" approach, non-asymptotically, minimax optimal in the quality of the policy it finds, given a fixed sample size? Here, the non-asymptotic regime refers to when the sample size is sublinear in the model size. With access to a generative model, we resolve this question in the strongest possible sense: our main result shows that \emph{any} high accuracy solution in the plug-in model constructed with $N$ samples, provides an $ε$-optimal policy in the true underlying MDP (where $ε$ is the minimax accuracy with $N$ samples at every state, action pair). In comparison, all prior (non-asymptotically) minimax optimal results use model free approaches, such as the Variance Reduced Q-value iteration algorithm (Sidford et al 2018), while the best known model-based results (e.g. Azar et al 2013) require larger sample sizes in their dependence on the planning horizon or the state space. Notably, we show that the model-based approach allows the use of \emph{any} efficient planning algorithm in the empirical MDP, which simplifies algorithm design as this approach does not tie the algorithm to the sampling procedure. The core of our analysis is avnovel "absorbing MDP" construction to address the statistical dependency issues that arise in the analysis of model-based planning approaches, a construction which may be helpful more generally.

研究动机与目标

评估朴素插件方法（通过最大似然法从 N 个样本构建经验 MDP 并在其中进行规划）在带有生成模型的有限折扣 MDP 中对 ε-最优策略是否具有极小极大最优性。
推导显式的非渐近样本复杂度界限，并与先前的模型基和模型无关结果进行比较。
证明在子线性样本 regime 下，经验 MDP 中的任意 ε-最优规划对真实 MDP 中的 ε-最优策略具有等价性。
开发一个吸收型 MDP 构造，以处理分析中的依赖性并展示其潜在的更广泛适用性。

提出的方法

在采样监视器提供状态—动作对的下一个状态的生成模型设定。
在每个状态—动作对使用最大似然估计的转移核，已经使用 N 个样本构建经验 MDP。
运行任意优化监视器（如值迭代/策略迭代）以在经验 MDP 中获得 ε_opt-最优策略。
证明若对每个 (s,a) 使用 N 条样本，其中 N b2 c log(...) / (1-)^{3} ε^2，则在 M 中得到的策略 Q^\ >= Q^\u2212* - ε - 9 ε_opt/(1-) 且 V^\ >= V^\u2212* - ε - 9 ε_opt/(1-) 的概率很高。
利用吸收型 MDP 构造来解耦 P 与值函数之间的依赖性并界定估计误差。
讨论在经典规划方法（值迭代、策略迭代）下的计算复杂度含义，以及在 |S||A| 上几乎线性的样本复杂度。

实验结果

研究问题

RQ1插件模型为导向的基于模型的规划方法在非渐进制度下是否实现极小极大最优的策略质量？
RQ2在高概率下获得 ε-最优策略所需的每个状态-动作对的样本复杂度是多少？
RQ3经验 MDP 的规划误差如何转化为真实 MDP 的策略性能，是否能够在不造成统一收敛爆炸的情况下控制依赖性？
RQ4吸收型 MDP 构造是否能促进分析并推广至更广的规划场景？

主要发现

算法	样本复杂度	ε-范围	参考文献
Phased Q-Learning	C \|S\|\|A\| / (1-e)^{7} ε^{2}	(0,(1-e)^{-1}]	Kearns and Singh, (1999)
Empirical QVI	\|S\|\|A\| / (1-e)^{5} ε^{2}	(0,1]	Azar et al., (2013)
Empirical QVI	\|S\|\|A\| / (1-e)^{3} ε^{2}	(0, 1/ sqrt{(1-e)\|S\|}]	Azar et al., (2013)
Randomized Primal-Dual Method	C \|S\|\|A\| / (1-e)^{4} ε^{2}	(0,(1-e)^{-1}]	Wang, (2017)
Sublinear Randomized Value Iteration	\|S\|\|A\| / (1-e)^{4} ε^{2} * poly log ε^{-1}	(0,1]	Sidford et al., 2018b
Variance Reduced QVI	\|S\|\|A\| / (1-e)^{3} ε^{2} * poly log ε^{-1}	(0,1]	Sidford et al., 2018a
Empirical MDP + any accurate black-box planner	\|S\|\|A\| / (1-e)^{3} ε^{2}	(0,(1-e)^{-1/2}]	This work

对真实 MDP 的 ε-最优策略可以从经验 MDP 中的 ε-最优规划获得，且当 N 满足 N c 与 γ、|S||A|、(1-)^{-1}、δ、ε^2 成正比时。
总样本复杂度为 O(|S||A| log(|S||A|/(1-)δ) / ((1-)^{3} ε^{2}))。
该结果对在经验 MDP 中找到近似最优策略的任何规划算法都成立，表明算法设计具有灵活性。
这一基于模型的结果与在 ε ∈ (0,1] 区间内的模型无关方法所知的极小极大最优速率相匹配，从而表明模型基规划在非渐进意义下也可以是极小极大最优。
吸收型 MDP 构造缓解分析中的统计依赖性问题，并可能在超出此设定的情况下有益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。