QUICK REVIEW

[论文解读] Learning Attentional Communication for Multi-Agent Cooperation

Jiechuan Jiang, Zongqing Lu|arXiv (Cornell University)|May 20, 2018

Reinforcement Learning in Robotics参考文献 26被引用 241

一句话总结

ATOC 在大规模多智能体强化学习中学习何时以及与谁进行通信，使用一个注意力单元和一个双向 LSTM 通信通道来提升协调性和可扩展性。

ABSTRACT

Communication could potentially be an effective way for multi-agent cooperation. However, information sharing among all agents or in predefined communication architectures that existing methods adopt can be problematic. When there is a large number of agents, agents cannot differentiate valuable information that helps cooperative decision making from globally shared information. Therefore, communication barely helps, and could even impair the learning of multi-agent cooperation. Predefined communication architectures, on the other hand, restrict communication among agents and thus restrain potential cooperation. To tackle these difficulties, in this paper, we propose an attentional communication model that learns when communication is needed and how to integrate shared information for cooperative decision making. Our model leads to efficient and effective communication for large-scale multi-agent cooperation. Empirically, we show the strength of our model in a variety of cooperative scenarios, where agents are able to develop more coordinated and sophisticated strategies than existing methods.

研究动机与目标

在带宽有限的情况下，促进大规模多智能体系统中的高效协作。
开发一个动态的、基于注意力的机制，用以决定何时让智能体进行通信。
提出一个双向 LSTM 通信通道，在已形成的组内有选择地共享信息。
在演员-评论家框架内实现端到端训练，以便联合学习策略与通信。
在协同与竞争的多智能体场景中展示相对于基线的改进。

提出的方法

引入一个注意力单元，给定智能体的编码观测和行动意图后，输出需要通信的概率。
在需要通信时，发起者选择一小组附近的协作者来形成通信组。
使用双向 LSTM 作为通信通道，将组内智能体的想法整合并产生用于协同动作的整合思想。
将整合后的思想与智能体自己的思想合并，输入策略网络以生成动作。
通过扩展的 DDPG 进行训练，包含共享策略和值网络，以及由 Delta Q 差异引导的注意力单元二元分类器。
在多种场景（协作导航、协作推球、捕食者-猎物）中与基线（CommNet、BiCNet、DDPG）进行比较。

实验结果

研究问题

RQ1注意力通信能否在大规模 MARL 中提升协调性和可扩展性？
RQ2在带宽受限的条件下，动态、内容感知的通信是否优于全连接和预定义架构？
RQ3基于注意力引导的分组如何在不同奖励结构（局部/全局、竞争性）下影响学习效率和最终性能？

主要发现

N	L	mean_reward (ATOC)	mean_reward (ATOC w/o Comm)	mean_reward (DDPG)	mean_reward (CommNet)	mean_reward (BiCNet)	collisions (ATOC)	collisions (ATOC w/o Comm)	collisions (DDPG)	collisions (CommNet)	collisions (BiCNet)	% occupied landmarks (ATOC)	% occupied landmarks (ATOC w/o Comm)	% occupied landmarks (DDPG)	% occupied landmarks (CommNet)	% occupied landmarks (BiCNet)
50	50	-0.04	-0.22	-0.14	-0.60	-0.52	13	47	32	59	51	92%	40%	22%	12%	16%

在协作导航中，ATOC 优于基线（CommNet、BiCNet、DDPG），实现更高的平均奖励和更少的碰撞。
通信是有益的：带通信的 ATOC 优于无通信的 ATOC。
动态的、基于注意力的通信减少了不必要的信息交换，相较于全连接基线在扩展到更多智能体时具有更好的可扩展性。
双向 LSTM 通信通道有选择地保留并传播信息，促成比简单平均方案更协调的组策略。
可视化显示通信活动集中在密集或复杂区域，随着协调稳定而下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。