QUICK REVIEW

[論文レビュー] Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette, Alessandro Lazaric|arXiv (Cornell University)|Feb 29, 2020

Advanced Bandit Algorithms Research参考文献 47被引用数 38

ひとこと要約

本論文は、低固有ベルマン誤差の下で線形価値関数近似を用いたエピソードRLに対する楽観的LSVIベースのアルゴリズム Eleanor を導入し、ほぼ最適な後悔境界とそれに一致する下界を証明し、H=1 の場合にミススペシフィケーション対応を含む LinUCB への回復を示します。

ABSTRACT

We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated extsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible.

研究の動機と目的

低固有ベルマン誤差（IBE）の下で近似的な線形行動価値関数を用いた探索を促進する。
IBE が低ランク MDP と LSPI 条件とどう関連し、より広い適用性を示すかを明らかにする。
Q関数の線形性を保持する、楽観的でグローバルに最適化されたLSVI風アルゴリズム（Eleanor）を開発する。
情報理論的に厳密な後悔保証を確立し、ミススペシフィケーションがある文脈線形設定への含意を示す。

提案手法

線形Q関数クラスに対する固有ベルマン誤差（IBE）を定義し、それを線形および低ランクMDPフレームワークと関連付ける。
計画最適化プログラムを解くことにより、 theta_t と楽観的摂動を地平線全体で同時に選択することで、最小二乗価値反復（LSVI）を楽観的設定へ拡張する。
パラメータ空間の楕円体制約をもつ theta に対する全体最適化摂動 ɷi over theta (bar_t) を導入し、線形性を維持し厳密な信頼界を可能にする。
Derive a regret bound R(T) = ɷrac{d_1 + abla + ...}{ } = ɷrac{sum_t d_t sqrt{K}}{ } + ɷrac{sum_t sqrt{d_t} I K}{ } (up to log factors), where I is the inherent Bellman error.
Eleanor が H=1 のとき LinUCB に崩壊することを示し、ミススペシフィケーションに対処するための改良された探索パラメータを提示する。
計算上の考慮事項と文脈的ミススペシフィケーションを伴う線形バンディットとの関連について論じる。

実験結果

リサーチクエスチョン

RQ1オンラインエピソードRLにおいて、低固有ベルマン誤差の下で線形Q関数クラスを用いた探索を効果的に行えるのか。
RQ2固有ベルマン誤差は、低ランクMDPおよびLSPI条件を越えてどのように関連し、一般化するのか。
RQ3線形性を維持しミススペシフィケーションを扱う楽観的LSVI型アルゴリズムの後悔保証はどのようになるか。
RQ4提案手法は特別な場合（H=1）に既知の結果（例：LinUCB）を回復するか、ミススペシフィケーションは境界にどのように影響するか。

主な発見

Eleanor は 𑁑rac{𑁃sum_{t=1}^H d_t sqrt{K}}{ } + 𑁑sum_{t=1}^H sqrt{d_t} I K (up to polylog factors).
固有ベルマン誤差フレームワークは、低ランクMDP仮定よりも厳密に一般的で、IBE の sqrt{d_t} 増幅を用いたミススペシフィケーションにも対処できる。
結果は定数および対数以外は改善不可能であり、ミススペシフィケーションなしの設定に対する下界の一致によって実証される。
H=1 のとき、Eleanor は文脈的線形バンディットにおけるミススペシフィケーションを考慮した修正探索パラメータを用いて LinUCB に縮約される。
分析は低ランクMDPへ拡張され、特徴次元の平方根の因子だけ prior bounds を改善し、オンライン設定におけるミススペシフィケーションを体系的に管理する方法を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。