QUICK REVIEW

[論文レビュー] Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang|arXiv (Cornell University)|Jul 11, 2019

Advanced Bandit Algorithms Research被引用数 219

ひとこと要約

論文は、線形 MDP 設定において多項式時間とサンプル複雑性を公に効率的に保証する最初の RL アルゴリズムを提示し、後悔は ~O~(d^3 H^3 T)^{1/2} の範囲で、状態と行動に依存しない。 optimistic LSVI with UCB ボーナスを用い、モデルの小さなミススペックに対しても頑健である。

ABSTRACT

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

研究の動機と目的

シミュレータや強い仮定に依存せず、関数近似を用いる公に効率的な RL アルゴリズムの設計を動機づける。
遷移と報酬が特徴量マップに対して線形である線形 MDP の設定を研究し、後悔とサンプル複雑性の保証を確立する。
状態空間と行動空間のサイズに依存しないサブ線形後悔を達成するアルゴリズムを開発・解析する。

提案手法

Upper-Confidence Bounds (UCB) を用いた Least-Squares Value Iteration (LSVI) の楽観的修正を採用する。
Q_h を特徴の線形関数として表現する: Q_h(x,a)=w_h^T φ(x,a)。
観測報酬と次の値の推定を用いて、正則化付き最小二乗法で w_h を更新する。
探索を促進するため、UCB ボーナス β(φ^T Λ_h^{-1} φ)^{1/2} を組み込み、Λ_h をグラム行列とする。
適切な λ と β を用いると、Assumption A（線形 MDP）の下で総後悔が Õ(d^3 H^3 T) になることを証明する。
ζ-近似線形 MDP に対する頑健性を、加法的な後悔項 Õ(ζ d H T) を伴って示す。

実験結果

リサーチクエスチョン

RQ1関数近似を用い、シミュレータや制約的な仮定なしに、ポリomial な実行時間とサンプル複雑性を持つ RL アルゴリズムを設計できるか。
RQ2線形 MDP 構造は、状態空間と行動空間のサイズに依存せず、サブ線形の後悔を保証するのに十分か。
RQ3ミスペシフィケーション（ζ-近似線形 MDP）が後悔と学習保証にどう影響するか。

主な発見

提案された LSVI-UCB アルゴリズムは、SとAに依存せず、高い確率で Õ(d^3 H^3 T) の後悔を達成する。
アルゴリズムは O(d^2 A K T) の時間、O(d^2 H + d A T) の空間で動作し、これも S と A に依存しない。
ζ-近似線形 MDP の下で、後悔は Õ(d^3 H^3 T) + Õ(ζ d H T √log) となり、ミススペシフィケーションに起因する T に比例する加法項。
結果は PAC 風の保証を含み、初期状態が固定されると ε-最適方針を Õ(d^3 H^4 / ε^2) サンプルで学習できる。
本手法は、シミュレータなしでサブ線形の後悔を達成することにより、表形式 RL と関数近似 RL の橋渡しをする。
分析は価値を意識した一様 Concentration と、線形構造を用いた経験的遷移と真の遷移の測度の橋渡しを導入する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。