QUICK REVIEW

[論文レビュー] Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

Ruosong Wang, Simon S. Du|arXiv (Cornell University)|May 1, 2020

Reinforcement Learning in Robotics参考文献 36被引用数 23

ひとこと要約

この論文は、2018年のCOLTでの未解決問題を解決し、表形式のエピソード的強化学習におけるサンプル複雑性が、これまでの予想とは異なり、計画の水平Hに対して多項式的ではなく対数的であることを証明している。著者らは、オンライン軌道合成アルゴリズムと最適方策のεネット構成を導入し、報酬が[0,1]に正規化された場合、長期間の強化学習はミニマックスの意味で短期間の強化学習よりも困難ではないことを示している。

ABSTRACT

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an $\varepsilon$-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest.

研究の動機と目的

長期間の強化学習におけるサンプル複雑性が計画の水平Hに多項式的に依存するかどうかという、2018年のCOLTでの未解決問題を解消すること。
長期間の強化学習がHに多項式的依存するがゆえに、短期間の強化学習よりも本質的に困難であるという一般的な予想に挑戦すること。
サンプル複雑性がHに関してのみ対数的に依存する、表形式のエピソード的強化学習の証明可能に効率的なアルゴリズムを開発すること。
そのlogカバーイング数がHに関して対数的に増加するような、最適方策のεネットを構築することにより、効率的な方策評価を可能にすること。
報酬が[0,1]に正規化された設定下で、長期間の強化学習は文脈的バンディット（H=1）と本質的に同じくらいの複雑さであることを示すこと。

提案手法

サンプル複雑性が方策クラスのlogカバーイング数に比例するように、与えられたクラスのすべての方策を段階的に評価するオンライン軌道合成アルゴリズムを提案する。
そのlogカバーイング数が計画の水平Hに関して対数的に増加するような、最適方策の集合のεネットを構築する。
1エピソードあたりの累積報酬が[0,1]に有界である正規化報酬設定を採用し、水平間での公平な比較を可能にする。
集中不等式と高確率バインディングを適用して、方策の推定値が真の値からε以内に収まる確率を高める。
エピソード数が|S|, |A|, log H, 1/ε, log(1/δ)の多項式関数に比例するように、確率1−δでε最適方策を返すことを証明する。
エピソード的MDPの構造と非負の報酬の性質を活用して、推定誤差を抑え、近似的に最適な方策への収束を保証する。

実験結果

リサーチクエスチョン

RQ1表形式のエピソード的強化学習におけるサンプル複雑性は、JiangとAgarwal（2018）が予想したように、計画の水平Hに多項式的に依存するか？
RQ2Hに依存するサンプル複雑性がHに関してのみ対数的に増加する、証明可能に効率的なアルゴリズムを設計できるか？
RQ3報酬が[0,1]に正規化された場合、長期間の強化学習と短期間の強化学習（例：文脈的バンディット）の間には、本質的な難易度の差があるか？
RQ4そのlogカバーイング数がHに関して対数的に増加するような、最適方策のεネットを構築できるか？
RQ5Hに関して多項式的依存がなく、Hに関して対数的であるような、表形式のエピソード的強化学習のミニマックス最適サンプル複雑性を達成できるか？

主な発見

提案されたオンライン軌道合成アルゴリズムのサンプル複雑性は、計画の水平Hに関してのみ対数的に増加する。
本論文は、Hに多項式的依存するがゆえに、長期間の強化学習が本質的に短期間の強化学習よりも困難であるという予想を否定した。
最適方策のεネットのlogカバーイング数は、Hに関して対数的に増加し、効率的な方策評価を可能にする。
アルゴリズムは、O(poly(|S|, |A|, log H, 1/ε, log(1/δ)))のエピソード数で、確率1−δでε最適方策を返す。
この結果は、報酬が[0,1]に正規化された場合、ミニマックスの意味で、長期間の強化学習は短期間の強化学習よりも困難ではないことを示唆する。
著者らは、表形式のエピソード的強化学習におけるミニマックス最適サンプル複雑性が、Õ(|S||A|poly(log H)/ε²)であると予想しており、これは水平依存の困難さがないことを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。