QUICK REVIEW

[论文解读] EPOpt: Learning Robust Neural Network Policies Using Model Ensembles

Aravind Rajeswaran, Sarvjeet Ghotra|arXiv (Cornell University)|Oct 5, 2016

Reinforcement Learning in Robotics参考文献 30被引用 144

一句话总结

EPOpt通过对一组仿真模型的对抗训练来训练鲁棒的神经策略，并使用目标域数据调整源分布。

ABSTRACT

Sample complexity and safety are major challenges when learning policies with reinforcement learning for real-world tasks, especially when the policies are represented using rich function approximators like deep neural networks. Model-based methods where the real-world target domain is approximated using a simulated source domain provide an avenue to tackle the above challenges by augmenting real data with simulated data. However, discrepancies between the simulated source domain and the target domain pose a challenge for simulated training. We introduce the EPOpt algorithm, which uses an ensemble of simulated source domains and a form of adversarial training to learn policies that are robust and generalize to a broad range of possible target domains, including unmodeled effects. Further, the probability distribution over source domains in the ensemble can be adapted using data from target domain and approximate Bayesian methods, to progressively make it a better approximation. Thus, learning on a model ensemble, along with source domain adaptation, provides the benefit of both robustness and learning/adaptation.

研究动机与目标

在模型错配和安全性考虑下，为物理控制动机鲁棒强化学习。
提出一种方法，通过使用集合训练在源模型分布上泛化策略。
引入使用目标域数据来适配源模型分布，以更好地逼近目标动力学。

提出的方法

使用从参数分布采样的源域模型集合来生成轨迹以更新策略。
为CVaR（ε百分位）目标优化，使学习聚焦于集合中表现最差的模型。
使用基于TRPO的批量策略优化子程序，利用最差的ε分数轨迹来更新策略。
通过使用目标域轨迹的近似贝叶斯更新来适应源域分布，以细化模型参数。
可选地，在适应过程中，当目标域差异显著时，应用重要性采样对模型样本重新加权。

实验结果

研究问题

RQ1在模型分布（集合）上的训练相较于单一模型训练，是否能提升对模型不匹配的策略鲁棒性？
RQ2基于ε-CVaR的EPOpt变体如何影响直接迁移到目标域的性能？
RQ3EPOpt能否学习对源域集合未覆盖的未建模效应具有鲁棒性的策略？
RQ4在目标域数据有限的情况下，源分布能多高效地适应到目标域？
RQ5在迁移学习中的贝叶斯强化学习，模型自适应相对于标准的极大似然模型选择有何比较优势？

主要发现

使用EPOpt-ε训练的策略在 Hopper 和 Half-Cheetah 基准测试上，相较于单模型 TRPO，在广泛的模型实例上具有更好的泛化。
EPOpt(0.1) 在多样化的模型参数下产生高度鲁棒的策略，具备强烈的直接迁移性能。
EPOpt对未建模效应具有鲁棒性，当源域包含多样化参数时，尽管当源分布中包含更多质量时鲁棒性会提升。
模型适配可以在相对较少的目标域数据下将源分布对齐到目标域，随着时间提高目标性能。
采用更保守的鲁棒策略并不会显著降低EPOpt的直接迁移性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。