QUICK REVIEW

[論文レビュー] Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Andrea Zanette, Emma Brunskill|arXiv (Cornell University)|Jan 1, 2019

Advanced Bandit Algorithms Research参考文献 26被引用数 66

ひとこと要約

本論文は、次状態値の最大条件分散に結びつく問題依存の後悔境界を達成しつつ、一般的な最悪ケース境界と一致するエピソード的有限ホライズンRLアルゴリズム Euler を紹介する。

ABSTRACT

Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problem-dependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm for finite horizon discrete MDPs and associated analysis that both yields state-of-the art worst-case regret bounds in the dominant terms and yields substantially tighter bounds if the RL environment has small environmental norm, which is a function of the variance of the next-state value functions. An important benefit of our algorithmic is that it does not require apriori knowledge of a bound on the environmental norm. As a result of our analysis, we also help address an open learning theory question~\cite{jiang2018open} about episodic MDPs with a constant upper-bound on the sum of rewards, providing a regret bound with no $H$-dependence in the leading term that scales a polynomial function of the number of episodes.

研究の動機と目的

強化学習における問題依存の後悔境界の必要性を動機づけ、最悪ケース分析を超えた問題の難易度を理解する。
事前の環境知識なしに分散認識ボーナスを用いて探索を適応させるアルゴリズム（Euler）を提案する。
環境の分散（Q*）に依存する高確率の後悔境界を導出し、特定の報酬境界設定でホライズンに依存しない振る舞いを示す。
環境ノルムが低い領域でこのアプローチがより厳密な境界をもたらすことを実証し、学習理論上の未解決の問題に対処する。

提案手法

有限ホライズンMDPのエピソード型上限-下探索アルゴリズム Euler を導入する。
次状態値の経験的分散に基づくベルンシュタイン型ボーナスを用いた不確実性下の楽観性を適用する。
価値関数の不確実性を考慮した補正ボーナスを組み込み、楽観性を保証する。
報酬推定、遷移ダイナミクス推定/楽観性、下位項に分解して後悔を分析する。
支配的な探索項を問題依存量Q*で境界づけ、それを最大リターンGと関連付ける。
支配項で既知のO(sqrt(HSAT))レートに一致する最悪ケース境界を証明する。

実験結果

リサーチクエスチョン

RQ1問題構造に依存するエピソード型有限ホライズンMDPの後悔境界を得ることは可能か、純粋な最悪ケースに依存しない形で？
RQ2経験的ベルンシュタイン不等式と価値関数の不確実性に基づく探索ボーナスは、事前のドメイン知識なしに環境依存のより厳密な後悔境界を生むか？
RQ3ホライズンと環境ノルムが有限ホライズンRLの後悔境界にどう影響するか？
RQ4提案手法は総報酬が有界のエピソード型MDPにおけるホライズン依存性に関する未解決の問題に対処できるか？

主な発見

Euler は高確率で tilde{O}( sqrt(Q*SAT) + sqrt(S)SAH^2 (sqrt{S}+sqrt{H}) ) の問題依存的後悔上界を達成する。
二つ目の境界 tilde{O}( sqrt(G^2/H · SAT) + sqrt(S)SAH^2 (sqrt{S}+sqrt{H}) ) を提供し、G が大きいときにしばし第一を引き締める。
系後連携のコロラリ: 報酬が限定された特定設定でホライズンに依存しない振る舞いを示し、支配項でミニマックス境界に一致。
Corollary 1.1 は最悪ケース境界 tilde{O}( sqrt{HSAT} + sqrt{S}SAH^2 (sqrt{S}+sqrt{H}) ) を述べる。
Corollary 1.2 は後続状態値の範囲 Phi_succ を用いる境界を示し、完全な V^{*} に依存せず、Phi や環境ノルムを必要としない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。