QUICK REVIEW

[论文解读] CEM-RL: Combining evolutionary and gradient-based methods for policy search

Aloïs Pourchot, Olivier Sigaud|arXiv (Cornell University)|Oct 2, 2018

Reinforcement Learning in Robotics参考文献 30被引用 95

一句话总结

CEM-RL 将 Cross-Entropy Method 与 TD3 结合，以共同利用进化探索和基于梯度的策略改进，在连续控制基准上实现具有竞争力或更优的性能与稳定性。

ABSTRACT

Deep neuroevolution and deep reinforcement learning (deep RL) algorithms are two popular approaches to policy search. The former is widely applicable and rather stable, but suffers from low sample efficiency. By contrast, the latter is more sample efficient, but the most sample efficient variants are also rather unstable and highly sensitive to hyper-parameter setting. So far, these families of methods have mostly been compared as competing tools. However, an emerging approach consists in combining them so as to get the best of both worlds. Two previously existing combinations use either an ad hoc evolutionary algorithm or a goal exploration process together with the Deep Deterministic Policy Gradient (DDPG) algorithm, a sample efficient off-policy deep RL algorithm. In this paper, we propose a different combination scheme using the simple cross-entropy method (CEM) and Twin Delayed Deep Deterministic policy gradient (td3), another off-policy deep RL algorithm which improves over ddpg. We evaluate the resulting method, cem-rl, on a set of benchmarks classically used in deep RL. We show that cem-rl benefits from several advantages over its competitors and offers a satisfactory trade-off between performance and sample efficiency.

研究动机与目标

将进化策略与深度强化学习结合用于策略搜索，以在探索、稳定性和样本效率之间取得平衡。
提出一种具体方法（cem-rl），将交叉熵方法与基于 TD3 的评估者驱动梯度更新耦合。
在标准 Mujoco 基准上对 cem-rl 与基线（cem、td3、multi-actor td3）以及现有混合方法（erl）进行比较评估。
分析进化组件对性能和稳定性的贡献，以及梯度改进。

提出的方法

使用来自当前均值策略周围高斯分布 Sigma 的人口体作为演员。
一半种群直接进行评估；另一半通过由 TD3/评估器引导的梯度步进行改进后再重新评估。
使用前半部分的高产出者更新种群均值和协方差（cem 更新）。
整合回放缓冲区并使用新经验训练评估器；对由种群衍生的演员应用梯度步。
可能强调采样的显式重要性混合，并对跨环境步骤和学习更新的资源分配进行明确讨论。

实验结果

研究问题

RQ1cem-rl 是否在标准连续控制基准上优于其组成部分（cem 和 td3）以及 td3 的多演员变体？
RQ2cem-rl 与 erl 在最终性能、收敛速度和学习稳定性方面有何比较？
RQ3在实践中，该组合是否提供改进的样本效率和/或对超参数的鲁棒性？
RQ4进化组件在仅仅提供基于种群的探索之外，到底贡献到多大程度？
RQ5哪些限制因素或环境特征会导致 cem-rl 表现不佳？

主要发现

cem-td3 在多个 Mujoco 基准上通常优于 cem、td3 和多演员 td3，且学习方差更小。
cem-rl 方法（cem-ddpg 和 cem-td3）在被测试环境中多环境上优于 erl，且 cem-td3 常提供最佳最终性能和更快的收敛。
消融研究表明，用跨演员共享梯度替代梯度一致性的 TD3 指引（多演员 TD3）会降低性能，这表明结合进化-梯度方案的好处。
与 erl 相比，cem-td3 往往提供更好的稳定性和最终性能，特别是在较难的环境如 walker2d-v2 和 ant-v2。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。