QUICK REVIEW

[论文解读] Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

Jacob Buckman, Danijar Hafner|arXiv (Cornell University)|Jul 4, 2018

Reinforcement Learning in Robotics参考文献 26被引用 97

一句话总结

STEVE 将基于模型的滚动展开与无模型 TD 学习结合起来，使用一个集成来估计不确定性并自适应滚动视界，从而在不引入模型偏差衰减的情况下实现高样本效率。

ABSTRACT

Integrating model-free and model-based approaches in reinforcement learning has the potential to achieve the high performance of model-free algorithms with low sample complexity. However, this is difficult because an imperfect dynamics model can degrade the performance of the learning algorithm, and in sufficiently complex environments, the dynamics model will almost always be imperfect. As a result, a key challenge is to combine model-based approaches with model-free learning in such a way that errors in the model do not degrade performance. We propose stochastic ensemble value expansion (STEVE), a novel model-based technique that addresses this issue. By dynamically interpolating between model rollouts of various horizon lengths for each individual example, STEVE ensures that the model is only utilized when doing so does not introduce significant errors. Our approach outperforms model-free baselines on challenging continuous control benchmarks with an order-of-magnitude increase in sample efficiency, and in contrast to previous model-based approaches, performance does not degrade in complex environments.

研究动机与目标

通过整合基于模型和无模型的方法，提升强化学习中的样本效能，减少样本复杂度的动机。
在使用不完美动态时，通过自适应利用滚动展开来解决模型偏差问题。
开发一种基于不确定性的按样本选择滚动视界的自适应方法，以最小化目标误差。

提出的方法

使用 Q 函数、奖励模型和动力学模型的集合来估计不确定性。
对学习到的模型进行多步展望（多种视界），并计算多个候选 TD 目标。
将 STEVE 目标计算为跨视界（0 到 H）的候选目标的逆方差加权混合。
在训练 Q 函数时，用 STEVE 目标替代 TD 学习中的 TD 目标。
通过偏差-方差分解提供理论依据，并对目标方差进行近似最小化。
在带有 DDPG 主干的连续控制基准上演示并比较性能。

实验结果

研究问题

RQ1随机集合和不确定性引导的视界选择能否提升基于模型的值展开的稳定性和效率？
RQ2在具有挑战性的连续控制任务中，面对模型不准确时，STEVE 是否优于纯无模型方法和标准 MVE？
RQ3动态视界加权如何影响样本效率以及对模型误差的鲁棒性？

主要发现

STEVE 在具挑战性的连续控制任务上显著提高了相对于无模型基线的样本效率。
与可能在嘈杂模型下发散的普通 MVE 不同，STEVE 对模型不完美性保持鲁棒。
对目标进行逆方差加权平均，利用不确定性估计来降低目标误差。
消融实验表明，是基于不确定性的再加权，而不仅仅是扩大模型集合，推动了性能提升。
实际时间实验表明，在并行化时，STEVE 与无模型方法竞争力相当，这是因为更好的样本效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。