QUICK REVIEW

[論文レビュー] MOReL : Model-Based Offline Reinforcement Learning

Rahul Kidambi, Aravind Rajeswaran|arXiv (Cornell University)|May 12, 2020

Reinforcement Learning in Robotics参考文献 86被引用数 158

ひとこと要約

MOReL は、オフラインデータから悲観的MDPを構築し、ほぼ最適なポリシーを学習するモデルベースのオフラインRLフレームワークを導入し、理論的なミニマックス最適保証とオフラインRLベンチマークにおけるSOTAの結果を達成します。

ABSTRACT

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline can greatly expand the applicability of RL, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; and (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g. generative modeling, uncertainty estimation, planning etc.) to directly translate into advances for offline RL.

研究の動機と目的

Motivate offline RL and address data-efficiency and safety when learning from static datasets.
Propose a model-based offline RL framework that mitigates model exploitation via pessimism.
Provide theoretical guarantees showing near-minimax optimality for MOReL in offline RL.
Demonstrate empirical SOTA performance on established offline RL benchmarks and D4RL.

提案手法

Learn an approximate dynamics model ˆP from the offline dataset.
Introduce an unknown state-action detector (USAD) to partition known vs unknown regions based on model accuracy via total variation distance.
Construct a pessimistic MDP with an absorbing HALT state that heavily penalizes unknown regions (−κ) and routes unknowns to HALT.
Plan in the pessimistic MDP to obtain a policy (PLANNER) that is approximately επ-suboptimal in the P-MDP.
Optionally estimate the behavior policy from data and incorporate model ensembles to quantify uncertainty for USAD.
Provide theoretical guarantees bounding the policy value gap between the offline MDP and the P-MDP, and proving near-minimax optimality.

実験結果

リサーチクエスチョン

RQ1How does MOReL perform relative to prior offline RL methods in standard benchmarks?
RQ2Can a model-based offline RL framework with pessimism provide strong theoretical guarantees and practical stability against model exploitation?
RQ3How does the quality and coverage of the offline data affect the learned policy in MOReL?
RQ4Does learning progress in the P-MDP translate effectively to progress in the true environment?

主な発見

MOReL achieves state-of-the-art results in 12 of 20 environment-dataset configurations and is competitive in the remaining configurations.
MOReL attains strong results in the D4RL benchmark, often surpassing or closely matching top methods across domains.
The P-MDP regularization via unknown-region penalties yields more stable and monotonic learning curves than naive model-based RL.
Theoretical bounds show the policy value in the P-MDP closely tracks the true MDP up to terms depending on starting-state distribution mismatch, model error α, and unknown-state hitting time.
Empirical results show that the quality of the data logging policy significantly influences MOReL’s performance; better logging policies lead to higher achievable policy values.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。