QUICK REVIEW

[論文レビュー] Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation

Christoph Dann, Yishay Mansour|arXiv (Cornell University)|Jun 19, 2022

Advanced Bandit Algorithms Research被引用数 26

ひとこと要約

本論文はエピソッドMDPにおけるepsilon-greedy方針での近視的探索の枠組みと理論を提示し、近視的探索ギャップを定義し、境界Bellman Eluder次元の下でサンプル複雑性と後悔の上界と下界を提供する。

ABSTRACT

Myopic exploration policies such as epsilon-greedy, softmax, or Gaussian noise fail to explore efficiently in some reinforcement learning tasks and yet, they perform well in many others. In fact, in practice, they are often selected as the top choices, due to their simplicity. But, for what tasks do such policies succeed? Can we give theoretical guarantees for their favorable performance? These crucial questions have been scarcely investigated, despite the prominent practical importance of these policies. This paper presents a theoretical analysis of such policies and provides the first regret and sample-complexity bounds for reinforcement learning with myopic exploration. Our results apply to value-function-based algorithms in episodic MDPs with bounded Bellman Eluder dimension. We propose a new complexity measure called myopic exploration gap, denoted by alpha, that captures a structural property of the MDP, the exploration policy and the given value function class. We show that the sample-complexity of myopic exploration scales quadratically with the inverse of this quantity, 1 / alpha^2. We further demonstrate through concrete examples that myopic exploration gap is indeed favorable in several tasks where myopic exploration succeeds, due to the corresponding dynamics and reward structure.

研究の動機と目的

価値関数ベースのRLを近視的探索で分析する枠組みを導入する。
近視的探索ギャップを定義し、近視的ポリシーによって最適でない関数がどれだけ容易に識別されるかを捉える。
Bellman Eluder次元の下で、関数近似を伴うepsilon-greedy RLのサンプル複雑性と後悔の境界を導く。
近視的探索が有利となる条件を示し、境界の厳密性を示す下界を提供する。

提案手法

データを近視的探索ポリシーの下で収集して後方帰納でQ関数を学ぶ最小二乗回帰ベースのアルゴリズム（Algorithm 1）を提案する。
新しい複雑さ指標として近視的探索ギャップalpha(f, F, Pi', expl, M)とその半径c(f, F, Pi', expl, M)を定義する。
ギャップと、Bellman Eluder次元dおよびカバーリング数を介して、サブ最適なF'サブセットを排除するのに必要なエピソード数との結びつきを構造化分析で示す。
alphaとcの項で表現される近視的探索に関する初の後悔とサンプル複数性の境界を提供する。
一般的な上界（定理1）と一致する下界を示し、alphaと次元に対する厳密な依存性を示す。

実験結果

リサーチクエスチョン

RQ1近視的探索ポリシー（epsilon-greedyを含む）は、関数近似においてサンプル効率の良い学習を生み出し得るのか。
RQ2近視的探索ギャップは、近視的ポリシーの下でサブ最適な価値関数を識別する難しさをどのように定量化するのか。
RQ3境界Bellman Eluder次元を有するエピソードMDPにおけるepsilon-greedy RLのサンプル複雑性と後悔の保証は何か。
RQ4MDPの構造条件（ダイナミクスと報酬）下で、近視的探索は特に有効となるのはどんな場合か。

主な発見

新しい近視的探索ギャップalpha(f, F, Pi', expl, M)が定義され、近視的探索戦略が候補価値関数がサブ最適であることを識別する容易さを捉える。
サブ最適なfのエピソード数の上界としてサンプル複雑性がO((log c(F',F)) / alpha(F',F)^2 · H^2 · d · log因子)であることを証明。
ほぼ一致する下界Omega(d / alpha(F',F)^2)が確立され、近視的探索ギャップとBellman Eluder次元に対する厳密な依存性を示す。
良好なダイナミクス（例: 小さな乗算的アクション変動）や文脈的バンディット構造（delta_P = 1）の下ではギャップが大きく、学習が速くなることを系論的に示すコロラリー。
密な報酬形状は探索ポリシー下の有益なサンプルの変化の仕方次第で近視的探索を改善または阻害する可能性がある。
基盤ベースの報酬形状は、タブレット設定ではギャップには影響を与えない。理由は形状化によるポリシーの報酬が同一のため。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。