QUICK REVIEW

[論文レビュー] Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Zihan Zhang, Xiangyang Ji|arXiv (Cornell University)|Sep 28, 2020

Advanced Bandit Algorithms Research参考文献 46被引用数 30

ひとこと要約

本論文は MVP を提案する。単調性値伝搬アルゴリズムで、 Bernstein-type ボーナスを備え、エピソード RL のほぼバンディットに匹敵するサンプル複雑度を達成し、後悔は文脈バンディットの下限付近、ホライゾン H に対する対数依存性を持つ。

ABSTRACT

Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $Ω\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.

研究の動機と目的

bounded total reward の下で、エピソードRL が文脈的バンディットとサンプル効率を一致できるか評価する。
ホライゾン H の対数依存性を持つ計算的に効率的なアルゴリズムを開発する。
Bernstein-type exploration bonus を導入し、楽観性と単調な価値伝搬を保証する。
理論的保証を提供する： regret および PAC bounds を CB の下限近く、対数因子まで。

提案手法

Monotonic Value Propagation (MVP) を提案する。これは新しい Bernstein-type ボーナスを持つ UCB ベースのモデルベースアルゴリズムである。
Q_h(s,a) = hat{r}(s,a) + hat{P}_{s,a} V_{h+1} + b_h(s,a) を定義し、楽観性を保証する。
モノトニック性の性質を導入する： Q_h(V_{h+1}) は V_{h+1} に対して増加し、楽観性のホライゾン依存なしの伝搬を可能にする。
報酬と遷移を更新し、エピソード間で推定値を伝搬するトリガベースのダブリング更新フレームワークを使用する。
ホライゾン全体の分散を高次モーメント展開によって制御する再帰的分散界技法を導出する。
罰則と PAC 増分境界を確立する：Regret(K) = O((sqrt(SAK) + S^2A) polylog(SAHK/δ)) および PAC-RL 境界 O((SA/ε^2) + (S^2A/ε)) polylog factors。

実験結果

リサーチクエスチョン

RQ1 bounded total reward の下で、エピソード RL は文脈的バンディットより追加のサンプル複雑度を要するのか？
RQ2 ログ的因子まで CB 下限と一致する罰則と PAC 保証を持つ計算的に効率的なアルゴリズムを設計できるか？
RQ3 ホライゾン依存性を多項式ではなく対数に抑えつつ、ほぼ最適なサンプル複雑度を維持できるか？
RQ4 ホライゾン全体で楽観性と単調な価値伝搬を保証する探索ボーナスの構造は何か？

主な発見

MVP は高確率で Regret = O((sqrt(SAK) + S^2A) polylog(SAHK)) を達成する。
標準的な還元により、ε-サブ最適な方針は O((SA/ε^2) + (S^2A/ε)) polylog(SAH/εδ) エピソードで見つかる。
アルゴリズムは計算的に効率的（多項式時間）であり、境界における H への依存を対数に抑える。
新しい Bernstein-type ボーナスと単調性の性質の組み合わせにより、ほぼバンディット性能に必要な楽観性を引き締める。
結果は RL と CB のギャップを大幅に縮め、S,A,K のスケーリングおよび H 依存性において従来の多項式時間アルゴリズムを改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。