QUICK REVIEW

[论文解读] Model-Ensemble Trust-Region Policy Optimization

Thanard Kurutach, Ignasi Clavera|arXiv (Cornell University)|Feb 28, 2018

Reinforcement Learning in Robotics参考文献 31被引用 218

一句话总结

ME-TRPO 使用模型集合和信任域策略优化，在基于模型的深度强化学习中达到最先进的样本效率，达到与模型无关方法相当的性能，仅需约100倍更少的数据。

ABSTRACT

Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.

研究动机与目标

通过利用学习到的动态模型来促进降低强化学习中的样本复杂度。
研究在模型和策略都使用神经网络时，普通的基于模型的深度强化学习中的不稳定性。
开发一个稳健的训练框架，保持模型不确定性并稳定策略更新。
证明集合模型和 TRPO 能在具有挑战性的任务中提高稳定性和性能。

提出的方法

引入一个神经动力学预测器的模型集合以捕捉不确定性。
用收集到的真实数据对所有模型进行训练，并从集合中抽样虚拟 roll-outs。
将时间反向传播（backpropagation through time）替换为似然比梯度估计器以进行策略优化。
使用 Trust Region Policy Optimization (TRPO) 对想象轨迹上的策略更新进行约束。
通过监控所有集合模型的性能来验证策略更新，并在改进低于阈值时停止。
迭代收集真实环境数据以完善集合并重新训练策略。

实验结果

研究问题

RQ1使用神经动力学的基于模型的强化学习在样本效率和最终性能上，与最先进的模型无关方法相比如何？
RQ2动态模型集合是否能对策略学习进行正则化并减轻模型偏差？
RQ3用似然比梯度估计替代 BPTT 是否能在长时间跨度任务中稳定训练？
RQ4在基于模型、集合正则化的框架中，与其他策略梯度方法相比，TRPO 的表现如何？

主要发现

该方法在实际数据约少 100 倍的情况下达到与模型无关方法相同的性能。
Vanilla model-based deep RL suffers from instability and model bias, particularly over long horizons.
使用一个动力学模型集合提供正则化，减少对单个模型的过拟合。
用 TRPO 替代 BPTT 能带来更稳定和更有效的策略学习。
增加集合中模型的数量可提高性能，尤其是在像 Half-Cheetah 和 Ant 这样的复杂任务上。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。