QUICK REVIEW

[论文解读] Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Tatsuya Matsushima, Hiroki Furuta|arXiv (Cornell University)|Jun 5, 2020

Reinforcement Learning in Robotics参考文献 62被引用 50

一句话总结

介绍 BREMEN，一种基于模型的离线 RL 方法，具备隐式 KL 正则化和一个动力学模型集合，以在保持有竞争力的样本效率的同时实现高部署效率（5–10 次部署）。

ABSTRACT

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN .

研究动机与目标

将部署效率作为 RL 在高成本部署场景中的一个实际评估指标来进行动机阐述（健康、机器人、对话、教育等）。
开发一种算法，在极少的数据收集策略变更下学习到成功策略。
通过利用模型集合和保守更新，在较小的离线数据集上实现强性能。

提出的方法

提出 Behavior-Regularized Model-ENsemble (BREMEN)，将确定性动力学模型的集合与通过信赖域优化更新的策略相结合。
利用来自模型集合的虚拟滚动（imaginary rollouts）来训练策略，减少对真实环境交互的依赖。
用最近数据的行为克隆初始化策略，以隐式对抗分布转移进行正则化。
应用基于 KL 的信赖域更新来约束策略改进并正则化学习（在目标中不对 KL 罚项进行显式惩罚）。
在收集的数据上训练动力学模型；在部署时，收集一批数据，更新模型集合，基于数据估计一个行为策略，重新初始化策略，并使用虚拟滚动进行 T 次离线 KL 约束更新。

实验结果

研究问题

RQ1部署效率是否可以作为一个实际度量，用于降低 RL 中的数据收集成本和风险？
RQ2在部署约束下，基于模型的离线方法，结合动态集成和隐式 KL 正则化，是否优于传统的在线/离线 RL 方法？
RQ3在标准离线 RL 基准测试中，BREMEN 在不同数据集规模（1M、100K、50K）以及部署受限场景下的表现如何？
RQ4行为克隆初始化和隐式 KL 正则化对缓解模型偏差和分布转移的影响如何？

主要发现

BREMEN 在 MuJoCo 连续控制任务上实现高部署效率，仅需 5–10 次部署即可学习到成功策略。
在离线批量设置中，BREMEN 在 1M 转换数据集上表现具有竞争力，在较小数据集上训练时（小 10–20 倍）优于基线。
在部署受限的设定下，与 SAC、ME-TRPO、BCQ、BRAC 相比，BREMEN 在有限部署条件下显示出显著的进展。
行为克隆初始化结合保守的信赖域更新提供了隐式 KL 正则化，在此设定下优于显式 KL 惩罚。
BREMEN 在标准基准上的离线表现接近最先进的无模型/离线方法，同时需要的部署次数要少得多。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。