QUICK REVIEW

[论文解读] Robust Reinforcement Learning for Continuous Control with Model Misspecification

Daniel J. Mankowitz, Nir Levine|arXiv (Cornell University)|Jun 18, 2019

Reinforcement Learning in Robotics参考文献 40被引用 38

一句话总结

本文提出 Robust MPO (R-MPO) 和 Soft Robust MPO (SRE-MPO)，在状态转移扰动下优化最坏情况回报，将 MPO 扩展为鲁棒和熵正则化的贝尔曼算子，并在九个 MuJoCo 领域和高维 Shadow hand 上展示了性能提升。

ABSTRACT

We provide a framework for incorporating robustness -- to perturbations in the transition dynamics which we refer to as model misspecification -- into continuous control Reinforcement Learning (RL) algorithms. We specifically focus on incorporating robustness into a state-of-the-art continuous control RL algorithm called Maximum a-posteriori Policy Optimization (MPO). We achieve this by learning a policy that optimizes for a worst case expected return objective and derive a corresponding robust entropy-regularized Bellman contraction operator. In addition, we introduce a less conservative, soft-robust, entropy-regularized objective with a corresponding Bellman operator. We show that both, robust and soft-robust policies, outperform their non-robust counterparts in nine Mujoco domains with environment perturbations. In addition, we show improved robust performance on a high-dimensional, simulated, dexterous robotic hand. Finally, we present multiple investigative experiments that provide a deeper insight into the robustness framework. This includes an adaptation to another continuous control RL algorithm as well as learning the uncertainty set from offline data. Performance videos can be found online at https://sites.google.com/view/robust-rl.

研究动机与目标

在连续控制 RL 中提出对转移动力学扰动（模型错配）的鲁棒性动机。
将鲁棒性引入到 MPO，并扩展到熵正则化目标。
开发具有收缩性质的鲁棒与软鲁棒熵正则化贝尔曼算子。
在多个 MuJoCo 领域和一个高维的 Shadow hand 上经验验证鲁棒性。
探索额外分析，例如从离线数据学习不确定性集，以及对其他算法的改编。

提出的方法

通过将标准的 TD 误差替换为对下一个状态的不确定性集合的最坏情况下确界，推导出鲁棒贝尔曼算子。
将其并入 MPO 的策略评估步骤，以学习鲁棒的价值函数并通过鲁棒提议分布推导出鲁棒策略。
将该算子扩展为鲁棒和软鲁棒熵正则化版本并证明收缩性质。
实例化 Robust Entropy-regularized MPO (RE-MPO) 和 Soft RE-MPO (SRE-MPO)，并与 E-MPO 与 MPO 进行比较。
通过在九个 MuJoCo 领域和 Shadow hand 上的实验来展示鲁棒性，并进行调查性分析（不确定性集设计、领域随机化、离线数据等）。

实验结果

研究问题

RQ1在模型错配下，将最坏情况鲁棒性引入转移扰动是否能提高连续控制任务的性能？
RQ2在不同领域中，鲁棒和软鲁棒熵正则化目标与标准 MPO 的比较如何？
RQ3鲁棒性技术能否迁移到其他 RL 算法，是否能够从离线数据中学习不确定性集？
RQ4不确定性集设计和领域随机化对鲁棒性性能有何影响？
RQ5鲁棒性在高维、具灵巧控制能力的 Shadow hand 等任务中如何扩展？

主要发现

Robust MPO (R-MPO) 和 Soft ROBUST MPO (SR-MPO) 在具有环境扰动的九个 MuJoCo 领域中优于它们的非鲁棒对应方法。
熵正则化版本（RE-MPO 和 SRE-MPO）的表现至少与其非鲁棒等价物相当，有时甚至更好。
在高维的 Shadow hand 任务中，鲁棒方法也比非鲁棒 MPO 显示出更好的性能。
Soft-robust 变体通常优于非鲁棒基线，尽管在某些任务中当扰动增大时其优势可能减弱。
从离线数据学习不确定性集（DDR-MPO）可在数据规模增加时提供具有竞争力甚至优于鲁棒性，与大数据集下达到 R-MPO 的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。