QUICK REVIEW

[论文解读] When to Trust Your Model: Model-Based Policy Optimization

Michael Jänner, Justin Fu|arXiv (Cornell University)|Jun 19, 2019

Reinforcement Learning in Robotics参考文献 44被引用 119

一句话总结

MBPO 使用从真实数据分支的短模型展开来实现快速学习：它在提高数据效率的同时达到与无模型方法相当的渐近性能，同时避免长时域模型的陷阱。

ABSTRACT

Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.

研究动机与目标

动机与分析在强化学习中如何最佳使用预测模型来进行策略优化。
在考虑模型误差和分布迁移的前提下，提供基于模型的更新的单调改进保证。
引入一个实用的、以经验为驱动的方法（MBPO），它使用短且分支的模型展开以提高数据效率。
证明经过仔细控制的模型使用在保持强大渐近性能的同时可以超过以往的基于模型的方法。

提出的方法

提出一个具有泛化误差 ε_m 和分布迁移误差 ε_π 的单调模型基策略改进框架，并推导真实回报对模型回报的界限。
引入从数据收集策略分布出发的分支展开，在学习得到的模型下运行 k 步，以限制误差累积。
提出 MBPO：训练一个概率动态模型的集合，使用 SAC 进行策略优化，并从回放缓冲区状态生成短的模型展开。
使用短而重复的模型展开来创造大量的模型生成数据，同时缓解模型利用和时域耦合问题。
在实践中，经验性地衡量模型泛化，并调整展开使用以平衡基于模型和无模型的更新。

实验结果

研究问题

RQ1在模型误差和分布迁移的前提下，基于模型的更新如何保证策略性能的单调改进？
RQ2在什么条件下，短模型展开在不加剧模型利用或误差叠加的情况下提供实际收益？
RQ3分支的短时域模型展开是否能在保留最佳模型无关渐近性能的同时实现更快的学习？
RQ4对未见策略分布的模型泛化如何影响基于模型数据的有效性？
RQ5在基于模型的策略优化中，哪些设计选择（模型集合、展开长度、优化算法）可以优化样本效率？

主要发现

MBPO 实现了比以往基于模型的方法更快的学习，同时在最终性能上与领先的无模型算法相匹配。
在连续控制基准测试中，MBPO 在数据和步数量级上显著少于无模型方法即可达到无模型性能（例如 Ant 任务：300k 步对比 SAC 的 3M 步）。
短的（甚至单步的）模型展开带来显著收益，较长的展开可能因误差累积而带来负效应。
分支展开策略（从真实数据分布出发再模拟 k 步）缓解了累积误差并扩展到更长的时域。
概率动态模型集合有助于捕捉不确定性并减少模型利用。
经验性测量表明，随着数据增加模型泛化得到提升，从而使对模型使用的界限更现实并支持其实践中的使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。