QUICK REVIEW

[论文解读] LEARNING TO SCHEDULE COMMUNICATION IN MULTI-AGENT REINFORCEMENT LEARNING

Daewoo Kim, Sangwoo Moon|arXiv (Cornell University)|Feb 5, 2019

Energy Harvesting in Wireless Networks参考文献 29被引用 59

一句话总结

SchedNet 训练一个集中式 critic 和分布式 actor，以在有限带宽和共享媒介约束下学习何时以及如何进行代理之间的通信，从而提升合作型 MARL 在非通讯和简单调度基线上的表现。

ABSTRACT

Many real-world reinforcement learning tasks require multiple agents to make sequential decisions under the agents' interaction, where well-coordinated actions among the agents are crucial to achieve the target goal better at these tasks. One way to accelerate the coordination effect is to enable multiple agents to communicate with each other in a distributed manner and behave as a group. In this paper, we study a practical scenario when (i) the communication bandwidth is limited and (ii) the agents share the communication medium so that only a restricted number of agents are able to simultaneously use the medium, as in the state-of-the-art wireless networking standards. This calls for a certain form of communication scheduling. In that regard, we propose a multi-agent deep reinforcement learning framework, called SchedNet, in which agents learn how to schedule themselves, how to encode the messages, and how to select actions based on received messages. SchedNet is capable of deciding which agents should be entitled to broadcasting their (encoded) messages, by learning the importance of each agent's partially observed information. We evaluate SchedNet against multiple baselines under two different applications, namely, cooperative communication and navigation, and predator-prey. Our experiments show a non-negligible performance gap between SchedNet and other mechanisms such as the ones without communication and with vanilla scheduling methods, e.g., round robin, ranging from 32% to 43%.

研究动机与目标

解决如何在部分可观测性下协调多个需要通信才能实现共同目标的代理。
处理实际约束：有限带宽和需要 MAC 风格调度的共享通信媒介。
学习哪些代理应广播、如何编码消息，以及如何基于接收到的消息选择动作。
推动集中训练与分布式执行，以提升协作性能。

提出的方法

提出 SchedNet，一个具有每个代理三大组件的深度 MARL 框架：消息编码器、动作选择器和权重生成器。
引入基于权重的调度算法（WSA），在有限带宽条件下为可广播的 K 个代理选择广播。
在训练过程中使用集中式 critic 来估计 V(s) 和 Q(s,w)，以引导 actor 更新。
使用 DDPG 训练权重生成器，以在给定观测的情况下优化调度权重。
实现两种 WSA 变体：Top(k) 和 Softmax(k)，可通过类似 CSMA 的分布式机制实现。
采用一个整合架构，在共同 critic 下联合训练编码器、动作选择器和权重生成器。

实验结果

研究问题

RQ1智能学习的代理间通信调度是否能在带宽和 MAC 限制下提升合作型 MARL 的性能？
RQ2代理应如何编码消息并分配广播机会以最大化集体奖励？
RQ3集中训练与分布式执行是否能在带有调度通信的情境中实现有效协调？
RQ4调度策略（Top(k) 与 Softmax(k)）如何影响 MARL 任务中的性能和学习到的通信策略？
RQ5相对于不通信基线和简单调度方案，存在怎样的性能提升？

主要发现

SchedNet 的表现超越了不进行通信的基线（IDQN、COMA）以及使用简单调度（轮询 Round Robin）的基线。
在 Predator-Prey 中，SchedNet 的 Top(1) 相较于 Round Robin 的调度可提高多达 43% 的性能。
在 Cooperative Communication and Navigation 中，SchedNet 显著优于基线，Top(1) 略优于 Softmax(1)。
学习到的调度权重优先考虑具有更大观测时域的代理，显示出基于重要性的自适应调度。
当观测状态包含可利用的信息（如猎物位置）时，被调度代理的消息更具信息量。
确定性 Top(k) 调度通常比概率性 Softmax(k) 调度带来更大收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。