QUICK REVIEW

[論文レビュー] Bridging Exploration and General Function Approximation in Reinforcement Learning: Provably Efficient Kernel and Neural Value Iterations.

Zhuoran Yang, Chi Jin|arXiv (Cornell University)|Nov 9, 2020

Advanced Bandit Algorithms Research参考文献 33被引用数 18

ひとこと要約

本稿では、カーネル関数およびニューラルネットワーク関数近似を用いた、初めての証明可能に効率的な強化学習アルゴリズムを提案する。最適化された最小二乗価値反復と探索を組み合わせることで、$ ilde{ olcal{O}}( olcal{F}_{ olcal{F}} H^2 olcal{O}(T))$ のレグレットを達成する。この手法は追加のデータ仮定を必要とせず、多項式時間計算量とサンプル計算量を保証し、大規模または無限の状態空間へのスケーラビリティを実現する。

ABSTRACT

Reinforcement learning (RL) algorithms combined with modern function approximators such as kernel functions and deep neural networks have achieved significant empirical successes in large-scale application problems with a massive number of states. From a theoretical perspective, however, RL with functional approximation poses a fundamental challenge to developing algorithms with provable computational and statistical efficiency, due to the need to take into consideration both the exploration-exploitation tradeoff that is inherent in RL and the bias-variance tradeoff that is innate in statistical estimation. To address such a challenge, focusing on the episodic setting where the action-value functions are represented by a kernel function or over-parametrized neural network, we propose the first provable RL algorithm with both polynomial runtime and sample complexity, without additional assumptions on the data-generating model. In particular, for both the kernel and neural settings, we prove that an optimistic modification of the least-squares value iteration algorithm incurs an $ ilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states and therefore even allows it to diverge, which exhibits the benefit of function approximation.

研究の動機と目的

関数近似に基づくRLにおける、探索と活用のトレードオフとバイアス-バリアンスのトレードオフを同時に取り扱う理論的課題に取り組む。
カーネル関数およびニューラルネットワーク関数近似器を用いて、大規模または無限の状態空間を対象とした証明可能に効率的なRLアルゴリズムを開発する。
データ生成モデルに関する追加仮定を課さずに、多項式時間計算量とサンプル計算量を達成する。
状態数に依存しないレグレットバウンドを確立し、高次元または連続的環境へのスケーラビリティを実現する。

提案手法

探索と活用のバランスを取るために、最小二乗価値反復アルゴリズムのオプtimisticな修正を提案する。
カーネル関数と過剰にパラメータ化されたニューラルネットワークを用いて、関数クラス$ olcal{F}$ 内の行動価値関数を表現する。
不確実性の推定値を価値更新に組み込むことで、未知の状態-行動ペアの探索を促進する。
関数クラス$ olcal{F}$ の内因的複雑さ$ olcal{F}_{ olcal{F}}$ を活用し、$ olcal{F}_{ olcal{F}}$、$H$、$T$ の関数としてレグレットをバウンディングする。
統計的学習理論を適用して推定誤差を制御し、関数近似下での一般化を保証する。
状態数に依存しない$ olcal{O}( olcal{F}_{ olcal{F}} H^2 olcal{O}(T))$ のスケーリングを示すレグレットバウンドを導出する。

実験結果

リサーチクエスチョン

RQ1カーネル関数およびニューラルネットワーク関数近似を用いた、探索と一般化をバランスさせた証明可能に効率的なRLアルゴリズムを設計できるか？
RQ2関数近似を用いたエピソード的RLにおいて、状態空間のサイズに依存しない最適なレグレットバウンドは何か？
RQ3計算的効率性と統計的一貫性を両立させるために、価値反復に楽観的アプローチを組み込む方法は何か？
RQ4関数クラスの内因的複雑さ$ olcal{F}_{ olcal{F}}$ は、関数近似RLにおけるレグレットにどのように寄与するか？
RQ5データ生成プロセスに制限的な仮定を課さずに、多項式時間計算量とサンプル計算量を達成することは可能か？

主な発見

提案されたアルゴリズムは、状態数に依存しない$\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ のレグレットバウンドを達成する。
レグレットバウンドは関数クラス$\mathcal{F}$ の内因的複雑さ$\delta_{\mathcal{F}}$ に比例し、近似誤差と推定誤差のトレードオフを捉えている。
アルゴリズムは、状態数が無限大または非常に大きい場合でも、多項式時間計算量とサンプル計算量を維持する。
楽観的最小二乗価値反復フレームワークは、関数近似RLにおける探索と活用のバランスを成功裏に実現する。
理論的分析は、データ生成モデルに関する追加仮定を必要とせず、手法の一般性を高めている。
結果は、関数近似が、特に高次元または連続的環境において、証明可能に効率的にRLに応用可能であることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。