QUICK REVIEW

[論文レビュー] A Survey on Model-based Reinforcement Learning

Fan-Ming Luo, Xu Tian|arXiv (Cornell University)|Jun 19, 2022

Reinforcement Learning in Robotics被引用数 26

ひとこと要約

この論文はモデルベース強化学習（MBRL）を概説し、深層RLにおける環境モデルの学習と使用方法に焦点を当て、モデルと方策の乖離を分析し、関連するRLパラダイムと実世界適用の進展を扱う。

ABSTRACT

Reinforcement learning (RL) solves sequential decision-making problems via a trial-and-error process interacting with the environment. While RL achieves outstanding success in playing complex video games that allow huge trial-and-error, making errors is always undesired in the real world. To improve the sample efficiency and thus reduce the errors, model-based reinforcement learning (MBRL) is believed to be a promising direction, which builds environment models in which the trial-and-errors can take place without real costs. In this survey, we take a review of MBRL with a focus on the recent progress in deep RL. For non-tabular environments, there is always a generalization error between the learned environment model and the real environment. As such, it is of great importance to analyze the discrepancy between policy training in the environment model and that in the real environment, which in turn guides the algorithm design for better model learning, model usage, and policy training. Besides, we also discuss the recent advances of model-based techniques in other forms of RL, including offline RL, goal-conditioned RL, multi-agent RL, and meta-RL. Moreover, we discuss the applicability and advantages of MBRL in real-world tasks. Finally, we end this survey by discussing the promising prospects for the future development of MBRL. We think that MBRL has great potential and advantages in real-world applications that were overlooked, and we hope this survey could attract more research on MBRL.

研究の動機と目的

MBRLがDRLにおいてモデルフリー法よりサンプル効率を改善できる理由を説明する。
環境モデルを学習する古典的および現代的手法をレビューする（表形式および関数近似）。
モデルの使用方法（計画、ロールアウト、さまざまなRL形態との統合）を論じ、方策/価値の乖離を分析する。
オフライン、ゴール条件付き、マルチエージェント、メタRLに対するモデルベース手法の最近の進展を要約する。
MBRLの実世界適用性と今後の方向性を強調する。

提案手法

MDPに対する表形式およびニューラルネットワークベースのモデル学習手法を説明し、M, R学習と尤度ベースの目的を含む。
予測損失（一ステップ）と確率的モデル化を議論し、アレアトリック不確実性を捉える。
モデル誤差の下で価値評価誤差を界計するシミュレーション補題（Theorem 1 および Theorem 2）を提示する。
長期的影響の緩和のための分布整合（JS分布、Wasserstein）を導入する（Simulation Lemma III）。
ポリシー分布の観点と外れポリシーのためのCVaRによる堅牢学習を探る。
多段回モデルと後方モデルのバリアントと、複雑な環境の表現学習を調査する。

実験結果

リサーチクエスチョン

RQ1学習されたMDPで訓練した場合と実環境で訓練した場合で、モデル近似が方策/価値の性能にどのように影響するか？
RQ2モデル誤差と報酬誤差を前提とした価値評価誤差に関する理論的限界は何か？
RQ3分布整合と敵対的手法が学習された遷移モデルの品質にどのような影響を与えるか？
RQ4長いロールアウトや部分観測・高次元タスクにおける累積誤差を低減する効果的な戦略は何か？
RQ5MBRLをオフライン、ゴール条件付き、マルチエージェント、メタRLのフレームワークと統合するにはどうすればよいか？

主な発見

モデル誤差は、特定の条件下でホライズン依存的（しばしば二次的）な成長として価値誤差へ伝播する。
確率的／学習されたモデルはアレアトリック不確実性を捉え、決定論的な一ステップ予測器と比較して頑健性を向上させる。
シミュレーション補題は、モデル誤差と性能損失を結ぶ方策評価誤差の界を提供し、短いロールアウトは累積誤差を緩和する。
分布整合（JS/ワッサースタイン）は長期挙動を改善し、いくつかの設定でサンプル複雑性を低減できる。
リプシッツ連続性を課したモデルは多段予測誤差を界限化し、累積効果を抑制できる。
Dreamerおよび関連する潜在ダイナミクスモデルは、世界モデルと潜在計画を用いて視覚ベースのタスクで高い性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。