QUICK REVIEW

[論文レビュー] What are the Statistical Limits of Offline RL with Linear Function Approximation?

Ruosong Wang, Dean P. Foster|arXiv (Cornell University)|Oct 22, 2020

Reinforcement Learning in Robotics参考文献 63被引用数 37

ひとこと要約

本論文は、リアライザビリティと線形Q関数および特徴の被覆性が有界であるという仮定の下で、オフラインRLが任意のポリシーを評価するのにホライゾンに対して指数のサンプルを必要とすることを証明する；強い条件なしにはサンプル効率の良いオフラインポリシー評価が不可能である理由を分析する。

ABSTRACT

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

研究の動機と目的

線形関数近似を用いたサンプル効率の良いオフラインRLにおいて、リアライザビリティと良好な特徴被覆が十分であるかを評価する。
線形設定におけるオフラインポリシー評価の基本的な限界を確立する。
誤差の増幅を例示し、効率性が可能となる条件を特定する。
オフラインデータとリアライザビリティの下でLSPEがどのように振る舞うかの洞察を提供する。

提案手法

仮定1および仮定2の下で、ホライゾンHにおける指数的なサンプル複雑性を示す難易度定理を述べ、形式化する。
情報理論的限界を証明するために、線形Q関数と特徴ノルムが有界な難しいMDPインスタンスを構築する。
オフラインデータの下で幾何学的な誤差増幅を illustration LSPE は
オフラインデータの下で幾何学的な誤差増幅を示す LSPE の解析を通じて、

実験結果

リサーチクエスチョン

RQ1リアライザビリティと良好な特徴被覆が、線形関数近似を用いたオフラインポリシー評価の多項式サンプル複雑性を保証するだろうか？
RQ2分布的強化性や表現力の強化によって、オフラインRLはサンプル効率になるのか？
RQ3リアライザビリティ仮定の下で、LSPEを用いたオフラインポリシー評価における誤差の伝播（増幅）はどのように起こるか？
RQ4線形関数近似を用いたオフラインRLの限界を示す具体的な厳しい事例は何か？

主な発見

上記の仮定の下で、任意のポリシーの価値を非自明に推定するには、ホライゾンに対して任意のアルゴリズムが指数のサンプルを必要とする。
構築上無偏推定量であるにもかかわらず、LSPEはホライゾンHで指数的な分散を持つ。
オフライン・線形・リアライザビリティ設定において、オフラインLSVIとLSPIのサンプル複雑性に指数的な分離がある。
2つの難しいインスタンス（報酬が疎である場合と決定論的なダイナミクス）は、報酬推定誤差と遷移推定誤差の同時増幅の可能性を示す。
サンプル効率の良いオフラインポリシー評価は、低い分布シフトやリアライザビリティを超えるより強い表現仮定といった、より強い条件の下でのみ可能である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。