QUICK REVIEW

[论文解读] MOPO: Model-based Offline Policy Optimization

Tianhe Yu, Garrett Thomas|arXiv (Cornell University)|May 27, 2020

Reinforcement Learning in Robotics参考文献 71被引用 217

一句话总结

MOPO 引入一种离线模型基线 RL 方法，通过对估计的模型不确定性进行奖励惩罚，以在离线数据分布之外实现安全泛化，在 D4RL 和分布外任务上优于先前的无模型和基于模型的方法。

ABSTRACT

Offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. This problem setting offers the promise of utilizing such datasets to acquire policies without any costly or dangerous active exploration. However, it is also challenging, due to the distributional shift between the offline training data and those states visited by the learned policy. Despite significant recent progress, the most successful prior methods are model-free and constrain the policy to the support of data, precluding generalization to unseen states. In this paper, we first observe that an existing model-based RL algorithm already produces significant gains in the offline setting compared to model-free approaches. However, standard model-based RL methods, designed for the online setting, do not provide an explicit mechanism to avoid the offline setting's distributional shift issue. Instead, we propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics. We theoretically show that the algorithm maximizes a lower bound of the policy's return under the true MDP. We also characterize the trade-off between the gain and risk of leaving the support of the batch data. Our algorithm, Model-based Offline Policy Optimization (MOPO), outperforms standard model-based RL algorithms and prior state-of-the-art model-free offline RL algorithms on existing offline RL benchmarks and two challenging continuous control tasks that require generalizing from data collected for a different task. The code is available at https://github.com/tianheyu927/mopo.

研究动机与目标

Motivate offline RL that can generalize beyond the data support and beyond the target task.
Develop a model-based offline RL algorithm that manages distributional shift via uncertainty penalties.
Provide theoretical guarantees that MOPO maximizes a lower bound on the true return.
Propose a practical MOPO implementation with ensemble-based uncertainty to penalize rewards.
Evaluate MOPO on standard offline RL benchmarks and tasks requiring out-of-distribution generalization.

提出的方法

Build on MBPO by incorporating an uncertainty-based reward penalty derived from model error estimates.
Define an uncertainty-penalized reward: tilde{r}(s,a) = r(s,a) - lambda * u(s,a).
Estimate dynamics with an ensemble of probabilistic models and use the maximum ensemble variance as u(s,a).
Train a policy on the uncertainty-penalized MDP to maximize the conservative return.
Provide a theoretical bound: eta_M(hat{pi}) >= max_pi { eta_M(pi) - 2 lambda epsilon_u(pi) }.
Offer practical guidelines for implementing MOPO, including how lambda relates to the error estimator and how it is computed.

实验结果

研究问题

RQ1Can offline model-based RL generalize beyond the data support better than model-free offline methods?
RQ2How should uncertainty about dynamics be quantified and incorporated into the reward to balance risk and return?
RQ3Does MOPO outperform existing model-free offline methods on standard benchmarks and in out-of-distribution tasks?
RQ4What theoretical guarantees can be provided for MOPO’s performance relative to the true MDP?

主要发现

Dataset type	BC	MOPO (ours)	MBPO	SAC	BEAR	BRAC-v
random	2.1	35.4 ± 2.5	30.7 ± 3.9	30.5	25.5	28.1
random	halfcheetah	1.6	11.7 ± 0.4	4.5 ± 6.0	11.3	9.5	12.0
random	hopper	1.6	11.7 ± 0.4	4.5 ± 6.0	11.3	9.5	12.0
medium	halfcheetah	36.1	42.3 ± 1.6	28.3 ± 22.7	-4.3	38.6	45.5
medium	hopper	29.0	28.0 ± 12.4	4.9 ± 3.3	0.8	47.6	32.3
medium	walker2d	6.6	17.8 ± 19.3	12.7 ± 7.6	0.9	33.2	81.3
mixed	halfcheetah	38.4	53.1 ± 2.0	47.3 ± 12.6	-2.4	36.2	45.9
mixed	hopper	11.8	67.5 ± 24.7	49.8 ± 30.4	1.9	10.8	0.9
mixed	walker2d	11.3	39.0 ± 9.6	22.2 ± 12.7	3.5	25.3	0.8
med-expert	halfcheetah	35.8	63.3 ± 38.0	9.7 ± 9.5	1.8	51.7	45.3
med-expert	hopper	111.9	23.7 ± 6.0	56.0 ± 34.5	1.6	4.0	0.8
med-expert	walker2d	6.4	44.6 ± 12.9	7.6 ± 3.7	-0.1	26.0	66.6

MOPO substantially outperforms model-free offline RL algorithms on the D4RL benchmark across several datasets.
MOPO also excels on tasks requiring generalization to out-of-distribution states, outperforming baselines and showing the ability to reach unseen states.
Two main results: (i) MOPO’s uncertainty-penalized framework yields conservative yet effective policy optimization; (ii) vanilla MBPO can outperform SAC in offline settings, supporting model-based approaches for batch RL.
An explicit trade-off between potential gain and risk is characterized, with a bound relating learned policy performance to model error along trajectories.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。