QUICK REVIEW

[论文解读] Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization

Thomas Pierrot, Mac\'e, Valentin|arXiv (Cornell University)|Jun 15, 2020

Reinforcement Learning in Robotics参考文献 37被引用 13

一句话总结

该论文提出 qd-pg，一种新颖的质量-多样性（Quality-Diversity）算法，结合策略梯度方法与一种新的多样性策略梯度（DPG），实现在连续控制环境中对多样化、高性能神经策略的样本高效发现。通过在状态和轨迹层面采用基于梯度的突变，qd-pg 在具有欺骗性、稀疏奖励的控制任务中，相比进化算法和策略梯度基线方法，展现出显著更高的样本效率与鲁棒性。

ABSTRACT

A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QDPG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample-efficient manner. Specifically, QDPG selects neural controllers from a MAP-Elites grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QDPG is significantly more sample-efficient than its evolutionary competitors.

研究动机与目标

通过促进解决方案的多样性而非仅性能，解决强化学习中的探索-利用权衡问题。
克服标准策略梯度方法在稀疏或误导性奖励会困住学习过程的欺骗性环境中的局限性。
通过用基于梯度的多样性搜索替代随机突变，提升质量-多样性（QD）优化的样本效率。
通过单次训练运行生成多样化、高性能的策略，实现在真实机器人应用中的鲁棒多解结果。
证明通过解耦更新方式结合质量与多样性目标，可获得优于联合优化的性能与探索效果。

提出的方法

提出一种多样性策略梯度（DPG），不仅计算策略性能的梯度，还计算在状态和轨迹层面的行为多样性梯度。
将 DPG 集成到 MAP-Elites 框架中，使用行为描述符（BD）将策略映射到多样化行为的网格空间。
使用回放缓冲区重用过渡数据多次，相比单轨迹采样一次的方法，提升了数据效率。
应用两次独立的梯度更新：一次用于质量（标准策略梯度），一次用于多样性（DPG），通过解耦优化避免梯度冲突。
通过测量状态空间中的新颖性来利用状态级新颖性，使算法能够在如 Ant-Maze 任务中利用每条轨迹的 3000 多个状态转换。
使用离策略强化学习训练神经策略，基于解析梯度实现高效的权重更新，而非依赖随机扰动。

实验结果

研究问题

RQ1基于梯度的多样性搜索机制是否能在高维控制任务的质量-多样性优化中优于随机突变？
RQ2通过解耦策略梯度结合质量与多样性目标，是否能实现优于联合优化的样本效率与解多样性？
RQ3qd-pg 在解决具有稀疏奖励的欺骗性控制问题方面，与最先进策略梯度方法和进化方法相比表现如何？
RQ4在 DPG 组件中使用状态级新颖性在多大程度上提升了复杂环境中的探索效率与收敛性？
RQ5所提出方法是否能在单次训练运行中生成鲁棒、多样化且高性能的策略，即使标准强化学习方法会失败？

主要发现

qd-pg 相较于传统进化 QD 方法，实现了显著更高的样本效率，将样本需求降低了数个数量级。
在 Ant-Trap 和 Ant-Maze 等欺骗性环境任务中，qd-pg 能成功发现高性能、多样化的策略，而标准策略梯度方法（如 TD3、SAC）则收敛至局部极小值。
qd-pg 在最终性能与数据效率方面均优于 pga-me，证明了在高维空间中，基于梯度的多样性搜索优于遗传突变。
消融实验表明，仅优化质量会导致在欺骗性环境中失败（因奖励陷阱），而仅优化多样性则导致收敛更慢且性能更低。
解耦质量与多样性更新相比联合优化，带来了更优的学习动态与最终结果，后者因梯度冲突而表现受损。
在 Ant-Maze 任务中，qd-pg 展现出较高的四分位距性能分布，表明学习过程在某些时刻可能不稳定，可能源于初始化敏感性或复杂景观动力学。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。