QUICK REVIEW

[論文レビュー] Model-Based Reinforcement Learning with Value-Targeted Regression

Alex Ayoub, Zeyu Jia|arXiv (Cornell University)|Jun 1, 2020

Advanced Bandit Algorithms Research参考文献 44被引用数 71

ひとこと要約

本論文は UCRL-VTR を導入し、価値ターゲット回帰を用いて信頼集合を構築し、楽観的計画を行うモデルベースの RL アルゴリズムである。モデルの複雑さが状態空間・行動空間のサイズに依存するのではなくスケールする後悔界限を達成し、線形混合にも境界を含む。

ABSTRACT

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_θ = \sum_{i=1}^{d} θ_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $ ilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $θ$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $Ω(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).

研究の動機と目的

既知の遷移モデル族 P の下でオンラインモデルベースの RL における後悔最小化の動機付け。
P のデータ整合な信頼集合を構築するための価値ターゲット回帰を提案。
これらの集合を活用する楽観的計画に基づくアルゴリズム（UCRL-VTR）を開発。
理論的な後悔境界を提供し、実証的に評価。

提案手法

既知のモデル族 P を持つエピソード的MDPを定義し、線形混合モデル P = sum_j theta_j P_j を検討する。
予測値 V_{h+1,k} と観測されたターゲット y_{h,k} に基づいて回帰損失 L_{k+1}(P, P̂_{k+1}) を形成する価値ターゲット回帰を導入。
回帰損失から信頼集合 B_k を構築、すなわち B_{k+1} = {P' ∈ P : L_{k+1}(P', P̂_{k+1}) ≤ β_{k+1}}。
各エピソードで B_k 上で楽観的計画を実行して P_k を選択し、V^{*}_{P',1}(s_1^k) を最大化し、次に誘導ポリシーを実行して価値ターゲットを更新。
Eluder 次元とカバー数に関する後悔境界を提供し、線形混合に特化して R_K = Ō(d √(H^3 K)) および下界 Ω(√(HdK)) を得る。
実装上の考慮事項と MuZero との関係を論じる。

実験結果

リサーチクエスチョン

RQ1一般的なモデルクラス P に跨るモデルベースの RL において、価値ターゲット回帰はサブリニアな後悔を生み出せるか。
RQ2後悔境界が P の複雑さ（例：Eluder 次元）や価値ターゲットのノイズ・非定常性にどのように依存するか。
RQ3従来のモデルベース手法と比較した場合、価値ターゲット信頼集合を用いた楽観的計画の利点と制限は何か。
RQ4線形混合モデルへの特化における後悔のスケーリングはどうなるか。
RQ5他のモデルベースの RL 手法および価値ターゲット回帰の変種と経験的にどのように比較されるか。

主な発見

線形混合モデルでは、アルゴリズムは後悔境界 Ō(d √(H^3 T)) を達成する。
一般的なモデルクラス設定では、後悔は価値ターゲットによって定義される関数クラスの Eluder 次元を介して境界づけられる。
上界は状態空間や行動空間のサイズに依存せず、線形の場合は下界 Ω(√(HdT)) に近い。
価値ターゲット回帰はタスクに関連するダイナミクスへモデル学習を集中させ、尤度ベースの回帰より効率を高める可能性がある。
実験では、楽観的計画を伴う価値ターゲット回帰が有効であり、楽観主義を取り除くことや価値ターゲット回帰を外すと性能が低下する。
本研究は MuZero との関係がある。MuZero は独立にモデル構築に価値ターゲット回帰を用いている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。