QUICK REVIEW

[論文レビュー] Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Bruno Scherrer|arXiv (Cornell University)|Jun 3, 2013

Reinforcement Learning in Robotics参考文献 17被引用数 32

ひとこと要約

本稿では、マーカフ決定過程（MDPs）における方策反復（PI）の収束複雑度に関する、改善され一般化された上界を提示する。HowardのPIは $ O\big(\frac{m}{1-\beta}\log\frac{1}{1-\beta}\big) $ 回の反復で収束し、Simplex-PIは $ O\big(\frac{nm}{1-\beta}\log\frac{1}{1-\beta}\big) $ 回の反復で収束する。遷移的および再帰的状態の性質に関する構造的仮定のもとで、より鋭い上界が得られ、強多項式性がより広いクラスのMDPsへ拡張される。

ABSTRACT

Given a Markov Decision Process (MDP) with $n$ states and a totalnumber $m$ of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal$\\gamma$-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most $O\\left(\\frac{m}{1-\\gamma}\\log\\left(\\frac{1}{1-\\gamma}\ ight)\ ight)$iterations, improving by a factor $O(\\log n)$ a result by Hansen etal., while Simplex-PI terminates after at most $O\\left(\\frac{nm}{1-\\gamma}\\log\\left(\\frac{1}{1-\\gamma}\ ight)\ ight)$iterations, improving by a factor $O(\\log n)$ a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~$\\gamma$: quantities ofinterest are bounds $\ au\\_t$ and $\ au\\_r$---uniform on all states andpolicies---respectively on the \\emph{expected time spent in transientstates} and \\emph{the inverse of the frequency of visits in recurrentstates} given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most $\ ilde O\\left(n^3 m^2 \ au\\_t \ au\\_r \ ight)$ iterations. This extends arecent result for deterministic MDPs by Post & Ye, in which $\ au\\_t\\le 1$ and $\ au\\_r \\le n$, in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most $\ ilde O(m(n^2\ au\\_t+n\ au\\_r))$iterations.

研究の動機と目的

MDPsにおける方策反復（PI）アルゴリズムの収束に要する反復回数に関する、既存の上界を改善・一般化すること。
決定的MDPsにとどまらず、遷移的および再帰的状態の構造的性質を持つMDPsのクラスへ、強多項式時間収束結果を拡張すること。
異なる方策更新戦略の下で、2つのPIの変種（HowardのPIおよびSimplex-PI）の収束挙動を分析すること。
割引因子 $\gamma$ に依存しない上界を、期待遷移時間と訪問頻度を特徴付ける構造的量 $\tau_t$ および $\tau_r$ を用いて導出すること。

提案手法

各反復で正の利得を持つすべての状態を更新するHowardのPIと、最大利得を持つ唯一の状態のみを更新するSimplex-PIを分析する。
状態はすべての方策において、遷移的（$\mathcal{T}$）および再帰的（$\mathcal{R}$）集合に分割されるという構造的MDP特性を導入・利用する。
$\tau_t$ を期待遷移時間の均一な上界、$\tau_r$ を一様初期化下での再帰的状態における最小訪問頻度の逆数として定義する。
ベルマン作用素と価値関数のダイナミクスを用いて、方策価値向上の進行に関する収縮的上界を導出する。
確率的行列のセザロ平均の変種を用いて、価値関数の進行速度を制限する。
反復的除外の議論を用いる：$ O(n\tau_r \log(n^2\tau_r)) $ 回の反復ごとに、少なくとも1つの非最適行動が除外されることを示し、対数的反復上界が得られる。

実験結果

リサーチクエスチョン

RQ1HowardのPIの収束複雑度は、$ O\big(\frac{m}{1-\gamma}\log\frac{1}{1-\gamma}\big) $ の上界を $ O(\log n) $ 要因で改善できるか？
RQ2構造的MDPパラメータ $\tau_t$ および $\tau_r$ を用いて、Simplex-PIの収束を割引因子 $\gamma$ に依存せずに上界で抑えられるか？
RQ3遷移的および再帰的状態の構造的性質を用いて、決定的MDPsを越えて強多項式時間収束を拡張することは可能か？
RQ4HowardのPIに対しては、同様の構造的上界を導くのがSimplex-PIと比べてなぜ困難か？
RQ52セットの状態分割（遷移的および再帰的）の下で、両方のPIの変種が $ \tilde{O}(m(n^2\tau_t + n\tau_r)) $ 回の反復で抑えられるか？

主な発見

HowardのPIは、最大で $ O\big(\frac{m}{1-\gamma}\log\frac{1}{1-\gamma}\big) $ 回の反復で収束し、Hansenら（2013）の先行結果を $ O(\log n) $ 要因改善する。
Simplex-PIは、最大で $ O\big(\frac{nm}{1-\gamma}\log\frac{1}{1-\gamma}\big) $ 回の反復で収束し、Ye（2011）の先行結果を $ O(\log n) $ 要因改善する。
構造的仮定のもとで、Simplex-PIは $ \tilde{O}(n^3m^2\tau_t\tau_r) $ 回の反復で収束する。これは、決定的MDPsに対するPostとYe（2013）の結果をはるかに広いクラスへ拡張する。
遷移的および再帰的状態に分割されたMDPsでは、HowardのPIおよびSimplex-PIの両方が $ \tilde{O}(m(n^2\tau_t + n\tau_r)) $ 回の反復で収束し、$\gamma$ に依存しない。
2セットの仮定のもとで、HowardのPIの収束速度は、価値ギャップの $\ell_1$ ノルムにおける幾何的減少により抑えられ、収縮係数は $ 1 - \frac{1}{n\tau_r} $ である。
本稿では、複数の状態を同時に更新するため、HowardのPIに対して同様の構造的上界を導くことが困難であることが示されている。これは、行動の除外と進行の追跡の解析を複雑にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。