QUICK REVIEW

[論文レビュー] On the Optimality of Sparse Model-Based Planning for Markov Decision Processes.

Alekh Agarwal, Sham M. Kakade|arXiv (Cornell University)|Jun 10, 2019

Machine Learning and Algorithms被引用数 13

ひとこと要約

この論文は、生成モデルを用いた割引マルコフ決定過程におけるスパースなモデルベース計画法のミニマックス最適性を確立している。著者らは、新しい吸収型MDPを構築することで、N個のサンプルから構築された経験的MDPにおける高精度な方策が、真のMDPにおいてもϵ-最適であることを証明した。これは長年の未解決問題を解消し、モデルベース手法がモデルフリー手法の最良の非漸近的サンプル複雑度を達成できることを示している。

ABSTRACT

This work considers the sample complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. In this work, we study the effectiveness of the most natural approach to model-based planning: we build the maximum likelihood estimate of the transition model in the from observations and then find an optimal policy in this empirical MDP. We ask arguably the most basic and unresolved question in model-based planning: is the naive plug-in approach, non-asymptotically, minimax optimal in the quality of the policy it finds, given a fixed sample size? With access to a generative model, we resolve this question in the strongest possible sense: our main result shows that \emph{any} high accuracy solution in the model constructed with $N$ samples, provides an $\epsilon$-optimal policy in the true underlying MDP. In comparison, all prior (non-asymptotically) minimax optimal results use model-free approaches, such as the Variance Reduced Q-value iteration algorithm (Sidford et al 2018), while the best known model-based results (e.g. Azar et al 2013) require larger sample sample sizes in their dependence on the planning horizon or the state space. Notably, we show that the model-based approach allows the use of \emph{any} efficient planning algorithm in the empirical MDP, which simplifies the algorithm design as this approach does not tie the algorithm to the sampling procedure. The core of our analysis is a novel absorbing MDP construction to address the statistical dependency issues that arise in the analysis of model-based planning approaches, a construction which may be helpful more generally.

研究の動機と目的

ナイーブなプラグイン型モデルベース計画法が有限サンプル設定においてミニマックス最適であるかどうかを解明すること。
ϵ-最適方策を取得するにあたり、モデルベース手法とモデルフリー手法の間のサンプル複雑度のギャップを埋めること。
N個のサンプルで学習した場合、経験的MDP内で効率的な計画アルゴリズムが適用された任意の方策が、真のMDPにおいてもϵ-最適であることを示すこと。
モデルベース計画法の分析における統計的依存性の問題を、新規のMDP構築法によって克服すること。
モデルベース計画法が、最先端のモデルフリー手法と同等の非漸近的サンプル複雑度を達成できることを示すこと。

提案手法

モデルベース計画法の分析における統計的依存性を分離するために、吸収型MDPを構築する。
生成モデルを用いて、各状態-行動ペアに対してN個のサンプルを収集し、遷移モデルの最尤推定値を構築する。
経験的MDPに対して任意の効率的計画アルゴリズムを適用し、方策を計算する。
新しい集中不等式を用いて、経験的MDPにおけるϵ-最適方策が真のMDPにおいてもϵ-最適であることを証明する。
吸収型MDPの構築を活用して、モデル推定の誤差から方策性能への誤差伝搬を制限する。
情報理論的下界と一致するサンプルサイズ依存性を示すことで、ミニマックス最適性を確立する。

実験結果

リサーチクエスチョン

RQ1プラグイン型モデルベース計画法は有限サンプル設定においてミニマックス最適か？
RQ2モデルベース計画法は、モデルフリー手法と同等の非漸近的サンプル複雑度を達成できるか？
RQ3モデルベース計画法の分析において生じる統計的課題は何か、そしてそれらをどのように克服できるか？
RQ4N個のサンプルから構築された経験的MDPが、真のMDPにおいてもϵ-最適方策を保証するか？
RQ5一般用途の計画アルゴリズムを経験的MDPに適用しても、サンプル複雑度が損なわれないか？

主な発見

提案手法は、割引MDPにおけるϵ-最適方策を達成するにあたり、ミニマックス最適なサンプル複雑度を達成する。
N個のサンプルを用いて、経験的MDP内で高精度な方策が計算されれば、真のMDPにおいてもϵ-最適であることが保証される。
本手法は、分散低減Q値反復法などの、既存の最良の非漸近的サンプル複雑度を持つモデルフリー手法と同等の性能を達成する。
吸収型MDPの構築により、モデルベース計画法の分析における統計的依存性の問題が効果的に解決された。
本手法により、経験的MDPにおいて任意の効率的計画アルゴリズムを用いることが可能となり、アルゴリズム設計が簡素化された。
結果として、モデルベース計画法が非漸近的領域において情報理論的に最適であるだけでなく、実用的にも有効であることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。