QUICK REVIEW

[論文レビュー] Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Mohammad Sadegh Talebi, Odalric-Ambrym Maillard|arXiv (Cornell University)|Mar 5, 2018

Reinforcement Learning in Robotics被引用数 19

ひとこと要約

本稿では、MDPにおける割引なし強化学習に対して、KL-UCRLアルゴリズムを用いて、従来の直径に基づく境界に代えて分散依存項を組み込んだ分散に配慮したレギュレート境界を提示する。主な結果は、高確率で成り立つレギュレート境界 $ widetilde{ mathcal{O}} left( sqrt{S\sum_{s,a} mathbf{V}^\star_{s,a}T} ight)$ であり、これはMDPの直径や行動数ではなく、バイアス関数の局所的分散を活用することで、先行研究の境界を改善するものである。

ABSTRACT

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\\widetilde {\\mathcal O}\\Bigl({\ extstyle \\sqrt{S\\sum_{s,a}{\\bf V}^\\star_{s,a}T}}\\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\\bf V}^\\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\\widetilde {\\mathcal O}(DS\\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.

研究の動機と目的

平均報酬基準の下で、MDPの直径をバイアス関数の局所的分散に置き換えることにより、割引なし強化学習におけるレギュレート境界を改善すること。
分散に配慮した解析を用いて、定常MDPにおけるKL-UCRLアルゴリズムのよりタイトな高確率レギュレート境界を提供すること。
行動数や直径ではなく、状態行動分散の和に比例するスケーリングを示す、新しいレギュレート境界を確立すること。
輸送補題の新規変種を考案し、KL濃度不等式を適用してMDPにおける解析を改善すること。
新しい境界が、古典的境界と比較して特定のベンチマークMDPにおいて桁違いの改善をもたらすかどうかを実証すること。

提案手法

割引なしRLのミニマックス下界を再考し、MDPの直径の代わりにバイアス関数の局所的分散を組み込む。
分散に配慮した濃度不等式と輸送補題の変種を用いた、KL-UCRLアルゴリズムの新規解析を提案する。
状態行動分散の和に比例する高確率レギュレート境界 $ widetilde{\mathcal{O}} left( sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T} ight)$ を導入し、ここで $ mathbf{V}^\star_{s,a}$ は状態 $s$ で行動 $a$ をとった際のバイアス関数の分散を表す。
ベルマン最適性方程式とバイアス関数の分解を用いて、レギュレートと非最適性ギャップ、状態訪問回数の関係を確立する。
アズマ＝フーディング不等式とカルバック・ライブラー（Kullback-Leibler）濃度不等式を適用して、価値推定の逸脱を制御する。
バイアス項と非最適性ギャップを分離するレギュレート分解を導出し、分散に基づく制御を可能にする。

実験結果

リサーチクエスチョン

RQ1MDPの直径をバイアス関数の局所的分散に置き換えることで、割引なしRLのレギュレート境界を改善できるか？
RQ2分散に配慮した解析を適用した場合、KL-UCRLアルゴリズムはよりタイトなレギュレート境界を達成できるか？
RQ3新しい分散依存レギュレート境界は、古典的境界 $ widetilde{\mathcal{O}}(DS\sqrt{AT})$ と比較して、スケーリングおよび実効的性能の面でどのように異なるか？
RQ4新規の輸送補題変種とKL濃度不等式は、低分散を示すMDPにおいてよりタイトな境界をもたらすか？
RQ5どのようなMDP構造において、分散に配慮した境界が直径に基づく境界よりも顕著な改善をもたらすか？

主な発見

提案されたレギュレート境界は $ widetilde{\mathcal{O}} left( sqrt{S\sum_{s,a} \mathbf{V}^\star_{s,a}T} ight)$ であり、直径 $D$ と行動数 $A$ の代わりに状態行動分散が組み込まれている。
KL-UCRLの既存の最良境界 $ widetilde{\mathcal{O}}(DS\sqrt{AT})$ よりも、$D$ と $A$ に依存しない形で改善されている。
ベンチマークMDPにおいて、低分散項のおかげで古典的境界と比較して桁違いの改善が得られることがある。
本研究では、独立に価値のある新たな輸送補題の変種とKL濃度技術が導入されている。
有効なレギュレートは $D + \sum_{s,a} \mathbb{E}[N_T(s,a)] \varphi(s,a)$ で抑えられ、ここで $\varphi(s,a)$ は非最適性ギャップであり、バイアス項は分散に配慮した境界によって制御される。
境界は高確率で成り立ち、有限状態空間・行動空間を持つ定常MDPの仮定の下で導出されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。