QUICK REVIEW

[论文解读] Effective Diversity in Population Based Reinforcement Learning

Jack Parker-Holder, Aldo Pacchiano|arXiv (Cornell University)|Feb 3, 2020

Reinforcement Learning in Robotics参考文献 64被引用 46

一句话总结

DvD 通过使用与任务无关的行为嵌入以及基于行列式的多样性目标来优化整个 RL 群体的行为多样性，包含 ES 和 TD3 的实现以及自适应多样性权衡。

ABSTRACT

Exploration is a key problem in reinforcement learning, since agents can only learn from data they acquire in the environment. With that in mind, maintaining a population of agents is an attractive method, as it allows data be collected with a diverse set of behaviors. This behavioral diversity is often boosted via multi-objective loss functions. However, those approaches typically leverage mean field updates based on pairwise distances, which makes them susceptible to cycling behaviors and increased redundancy. In addition, explicitly boosting diversity often has a detrimental impact on optimizing already fruitful behaviors for rewards. As such, the reward-diversity trade off typically relies on heuristics. Finally, such methods require behavioral representations, often handcrafted and domain specific. In this paper, we introduce an approach to optimize all members of a population simultaneously. Rather than using pairwise distance, we measure the volume of the entire population in a behavioral manifold, defined by task-agnostic behavioral embeddings. In addition, our algorithm Diversity via Determinants (DvD), adapts the degree of diversity during training using online learning techniques. We introduce both evolutionary and gradient-based instantiations of DvD and show they effectively improve exploration without reducing performance when better exploration is not required.

研究动机与目标

通过利用一群多样化的智能体来收集多样化的经验来激励 RL 的探索。
用基于行列式的多样性度量替代基于对偶距离的多样性度量，对行为嵌入进行评估。
通过汤姆逊采样在在线上适配 lambda_t，以平衡训练中的奖励与多样性。
提供两种实际实现（DvD-ES 和 DvD-TD3），展示改进的探索和性能。
证明在不需要探索时，多样性促进的更新不会损害性能。

提出的方法

将任务无关的行为嵌入定义为若干状态下的策略动作：phi(theta^i) = {pi_theta^i(·|s)}_s in S。
用 Det(K(phi(theta^i),phi(theta^j))) 来衡量群体多样性，其中 K 是对嵌入的半正定核。
优化联合目标 J(Θ) = sum_i E[R(tau) for pi_theta^i] + lambda_t * Div(Θ) 并通过汤姆逊采样在线自适应 lambda_t。
引入两种实例：DvD-ES（带有联合多样性项的进化策略）和 DvD-TD3（带可微多样性梯度的离策略 TD3）。
提供理论依据表明最大化行列式可恢复多样化的高性能解（定理 3.3），并讨论对 SE 核的平均成对距离的一阶关系。
对状态的自适应采样以计算嵌入，并对核选择、状态采样和自适应机制进行消融研究。

实验结果

研究问题

RQ1基于行列式的多样性是否能够在 RL 智能体群体中促进探索而不导致有害的循环或冗余？
RQ2通过行列式最大化群体多样性是否能在多模态任务中产生多样化且高性能的策略？
RQ3我们能否在线有效自适应多样性与奖励之间的权衡以平衡探索与开发？
RQ4当不需要多样性时，DvD-ES 与 DvD-TD3 是否仍能维持性能？
RQ5DvD 对核选择与嵌入采样的敏感性有多高？

主要发现

DvD 能解决 vanilla ES 与基于新颖性的 ES 无法解决的探索任务（例如被墙壁包围的目标导航）。
在多模态任务（Cheetah，Ant）上，DvD 能在不同模态上学习出多样化且高性能的行为。
在 OpenAI Gym 的单模态任务中，DvD 相比于 vanilla ES 的性能损失更小，并通过避免循环超越新颖性驱动的 NSR-ES。
在 Humanoid-v2 中，DvD-TD3 在 1M 时间步内实现了更优的样本效率和最终性能（中位最佳约 6091 对 5654 E-TD3），超越了先前方法。
自适应 lambda_t 在各环境中比固定设置的性能提升更明显。
核敏感性实验表明大多数核在性能上接近 SE，表明对核选择具有鲁棒性。
DvD-TD3 在 Humanoid-v2（前进移动）达到超过 6000 的奖励，在 1M 步内为离策略群体方法带来实际收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。