QUICK REVIEW

[论文解读] Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning

Qian Long, Zihan Zhou|arXiv (Cornell University)|Mar 23, 2020

Reinforcement Learning in Robotics参考文献 52被引用 38

一句话总结

本文提出 Evolutionary Population Curriculum (EPC)，一种通过逐步增加代理数量并使用进化选择在各阶段维持适应性的课程学习框架，从而扩展多智能体强化学习。

ABSTRACT

In multi-agent games, the complexity of the environment can grow exponentially as the number of agents increases, so it is particularly challenging to learn good policies when the agent population is large. In this paper, we introduce Evolutionary Population Curriculum (EPC), a curriculum learning paradigm that scales up Multi-Agent Reinforcement Learning (MARL) by progressively increasing the population of training agents in a stage-wise manner. Furthermore, EPC uses an evolutionary approach to fix an objective misalignment issue throughout the curriculum: agents successfully trained in an early stage with a small population are not necessarily the best candidates for adapting to later stages with scaled populations. Concretely, EPC maintains multiple sets of agents in each stage, performs mix-and-match and fine-tuning over these sets and promotes the sets of agents with the best adaptability to the next stage. We implement EPC on a popular MARL algorithm, MADDPG, and empirically show that our approach consistently outperforms baselines by a large margin as the number of agents grows exponentially.

研究动机与目标

激发在代理数量呈指数级增长的环境中学习的挑战。
提出一种对代理数量变量保持不变的策略/价值函数架构，使其能够泛化到不定数量的代理。
引入一种进化选择机制，以解决课程阶段之间的目标错配问题。
通过在多样化多智能体任务上将 EPC 应用于 MADDPG，展示其可扩展性和鲁棒性。

提出的方法

采用基于自注意力、对代理数量不变的 Q 函数和策略架构，以处理任意数量的代理。
将训练分成阶段，阶段中的代理数量逐步增加，形成课程。
为每个角色维持 K 个并行代理集合，并在集合之间进行混合匹配（交叉）以创建扩展后的种群。
在课程增长期间，将 MARL 微调作为引导突变算子。
应用进化选择过程，根据在扩展环境中的适应度，选择下一阶段中最能适应的代理集合。
在 MADDPG 上演示 EPC，并在三个环境中与基线进行比较。

实验结果

研究问题

RQ1如何在 MARL 中进行代理数量扩展而不损失稳定性或性能？
RQ2与简单克隆相比，进化混合匹配方法是否能提高对更大规模群体的适应？
RQ3基于注意力的、对群体数量不变的架构是否能够支持跨任意代理数量的可扩展 MARL 训练？
RQ4随着代理数量呈指数增长，EPC 相对于普通的代理数量课程和非课程化 MARL 基线能带来哪些收益？

主要发现

随着代理数量增加，EPC 在基线方法之上持续领先，包括人口规模呈指数增长的情况。
注意力基础、对群体数量不变的架构相较于基线 MADDPG 和均场方法，提升了 MADDPG 的性能。
普通人口课程随着规模扩大而下降，而 EPC 在各个规模上保持优越性能。
在 Grassland 中 EPC 提高了生存率和草地摄取量，在 Adversarial Battle 和 Food Collection 中实现了更好的协作和资源收集。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。