QUICK REVIEW

[论文解读] Deep Implicit Coordination Graphs for Multi-agent Reinforcement Learning

Sheng Li, Jayesh K. Gupta|arXiv (Cornell University)|Jun 19, 2020

Reinforcement Learning in Robotics参考文献 50被引用 38

一句话总结

DICG 通过自注意力和图卷积学习动态隐式协调图，以在集中式与去中心化 MARL 之间取得平衡，在捕食者-猎物、SMAC 和交通路口任务中提升协同能力。

ABSTRACT

Multi-agent reinforcement learning (MARL) requires coordination to efficiently solve certain tasks. Fully centralized control is often infeasible in such domains due to the size of joint action spaces. Coordination graph based formalization allows reasoning about the joint action based on the structure of interactions. However, they often require domain expertise in their design. This paper introduces the deep implicit coordination graph (DICG) architecture for such scenarios. DICG consists of a module for inferring the dynamic coordination graph structure which is then used by a graph neural network based module to learn to implicitly reason about the joint actions or values. DICG allows learning the tradeoff between full centralization and decentralization via standard actor-critic methods to significantly improve coordination for domains with large number of agents. We apply DICG to both centralized-training-centralized-execution and centralized-training-decentralized-execution regimes. We demonstrate that DICG solves the relative overgeneralization pathology in predatory-prey tasks as well as outperforms various MARL baselines on the challenging StarCraft II Multi-agent Challenge (SMAC) and traffic junction environments.

研究动机与目标

在联合行动空间庞大且静态协调图不足以满足需求的多智能体强化学习中，激发对更好协调的需求。
引入 DICG：从观测中推断动态协调图，并使用图神经网络计算联合行动值或行动。
通过 CTCE（集中训练-集中执行）和 CTDE（集中训练-去中心化执行）机制实现集中与去中心化执行之间的权衡。
证明 DICG 能缓解相对过度推广，并在像 SMAC 和交通路口这样的复杂多智能体任务上优于基线。

提出的方法

使用自注意力模块从智能体嵌入学习一个隐式的软协调图，其邻接矩阵为 M。
在 M 上应用图卷积以传递信息并整合跨智能体的信息。
提供两种使用模式：DICG-CE 用于集中训练-集中执行（CTCE）和 DICG-DE 用于集中训练-去中心化执行（CTDE）并带有集中基线。
使用标准 actor-critic 方法（PPO）对整个 DICG 模块进行端到端训练，利用联合行动或集中基线进行优势估计。
编码器跨智能体共享参数，从观测 o_i 产生嵌入 e_i，用于计算注意力权重 μ_ij = softmax_j attention(e_i, e_j)。
在经过 m 层图卷积并与 E(0) 形成残差连接后，得到最终嵌入 Ê。
DICG-CE 使用 Ê 生成智能体的动作；DICG-DE 使用 Ê 来估计集中 critic 基线以进行优势估计。

实验结果

研究问题

RQ1在没有领域特定启发式方法的情况下，动态学习的隐式协调图是否能改善多智能体协作？
RQ2将基于注意力的图结构与 GCN 结合是否能缓解 MARL 任务中的相对过度泛化？
RQ3在 CTCE 和 CTDE 机制下，DICG 与完全集中或去中心化基线相比的表现如何？
RQ4DICG 的嵌入在预测其他智能体的行动或价值方面比原始观测更具信息量吗？

主要发现

方法	8m_vs_9m	3s_vs_5z	6h_vs_8z
DCG	55 ± 10%	85 ± 3%	10 ± 5%
VDN	49 ± 5%	72 ± 10%	0
QMIX	60 ± 11%	95 ± 1%	5 ± 5%
CENT-LSTM	42 ± 6%	0	0
DEC-LSTM	65 ± 16%	94 ± 5%	0
DICG-CE-LSTM	72 ± 11%	96 ± 3%	9 ± 9%
DICG-DE-LSTM	87 ± 6%	99 ± 1%	0

DICG 在捕食者-猎物任务中解决相对过度泛化问题，而完全的集中或去中心化方法很难解决。
在 StarCraft II 多智能体挑战（SMAC）情境和交通路口任务上，DICG 在胜率和样本效率方面优于基线。
学得的注意力权重会适应协同需求（例如随着惩罚增大，对远处智能体的注意力提高）。
DICG 之后的嵌入比 DICG 之前的嵌入更能预测其他智能体的行动，表明隐式协调推理成功。
在 SMAC 中，DICG-DE-LSTM 在多个地图上实现最高且最稳定的胜率，超过 DCG、VDN 和 QMIX 基线。
在交通路口任务中，DICG-DE-MLP 在中等和困难模式下表现出色，超过若干去中心化基线以及一些集中基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。