QUICK REVIEW

[论文解读] GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms

Cédric Colas, Olivier Sigaud|arXiv (Cornell University)|Feb 14, 2018

Reinforcement Learning in Robotics参考文献 41被引用 75

一句话总结

GEP-PG 将 Goal Exploration Processes 与 Deep Deterministic Policy Gradient 结合，以实现探索与开发的解耦，从而提高样本效率、最终性能与稳定性，在 CMC 和 Half-Cheetah 基准上表现出色。

ABSTRACT

In continuous action domains, standard deep reinforcement learning algorithms like DDPG suffer from inefficient exploration when facing sparse or deceptive reward problems. Conversely, evolutionary and developmental methods focusing on exploration like Novelty Search, Quality-Diversity or Goal Exploration Processes explore more robustly but are less efficient at fine-tuning policies using gradient descent. In this paper, we present the GEP-PG approach, taking the best of both worlds by sequentially combining a Goal Exploration Process and two variants of DDPG. We study the learning performance of these components and their combination on a low dimensional deceptive reward problem and on the larger Half-Cheetah benchmark. We show that DDPG fails on the former and that GEP-PG improves over the best DDPG variant in both environments. Supplementary videos and discussion can be found at http://frama.link/gep_pg, the code at http://github.com/flowersteam/geppg.

研究动机与目标

在连续动作强化学习中，动机与处理探索挑战，尤其是在稀疏或欺骗性奖励下。
提出一个两阶段框架，先通过 Goal Exploration Processes (GEP) 进行探索，再通过基于回放缓冲区的 DDPG 变体进行开发。
在低维基准（Continuous Mountain Car）和高维基准（Half-Cheetah）上进行实证评估。
评估对最终性能、样本效率以及学习变异性的影响。
讨论 Gep-PG 框架的鲜性、局限性与潜在的扩展。

提出的方法

定义两个学习阶段：一个探索阶段，使用 Goal Exploration Processes 生成多样化的策略库；
将结果的 (theta, o) 对存储在内存中，随后以高斯噪声对观测结果进行抽样，以生成新策略；
用 GEP 生成的样本填充 DDPG 的回放缓冲区，并使用动作扰动或参数扰动的 DDPG 变体进行训练；
与标准 DDPG 变体进行比较，分析 CMC 与 HC 上的性能、方差与样本效率；
采用标准化的评估方法，使用多个种子与自助法/统计检验来评估显著性；
报告在训练过程中最好策略的绝对性能与最后 100 次评估轮中的性能。

实验结果

研究问题

RQ1通过 GEP 将探索与开发解耦，是否能相较于含探索噪声的标准 DDPG 在连续动作强化学习中提高学习？
RQ2GEP 与 Gep-PG 在低维的欺骗性奖励问题（Continuous Mountain Car）与高维基准（Half-Cheetah）上的表现如何？
RQ3 策略复杂度与回放缓冲区内容对 Gep-PG 的表现与稳定性有何影响？
RQ4Gep-PG 在各基准上是否比 DDPG 变体更容易获得较高的样本效率且方差更小？
RQ5 将将发展性探索与深度 RL 相结合的未来方向与扩展可能是？

主要发现

GEP 单独就能提供具有竞争力的探索，并且在 CMC 基准上由于欺骗性梯度问题而可能超越 DDPG 变体。
在 Half-Cheetah 上，Gep-PG 在最终性能与方差上显著超越 DDPG 变体，达到当时的近似最先进的结果。
带动作扰动的 DDPG 在欺骗性或稀疏奖励设置下可能不如参数扰动的表现。
用 GEP 生成的样本填充 DDPG 回放缓冲区可以提高样本效率、最终性能，并降低与从头开始训练相比的变异性。
GEP-PG 的鲜稳性在探索阶段的不同 Gep 回合数范围内都能观察到，且性能提升稳定。
缓冲区中更大、更多样化的轨迹集与 Gep-PG 表现与泛化性能呈正相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。