QUICK REVIEW

[论文解读] Counterfactual Multi-Agent Policy Gradients

Jakob Foerster, Gregory Farquhar|arXiv (Cornell University)|May 24, 2017

Fuel Cells and Related Materials被引用 478

一句话总结

COMA 引入一个中心化的评论家，并配有每个代理的对比基线以实现对去中心化策略在合作多代理强化学习中的有效信用分配，在去中心化星际争霸微观管理任务上得到验证。

ABSTRACT

Cooperative multi-agent systems can be naturally used to model many real world problems, such as network packet routing and the coordination of autonomous vehicles. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.

研究动机与目标

动机：在合作多代理强化学习中推动去中心化策略的必要性，并在全局奖励下解决信用分配问题。
提出 COMA：一种带有中心化评论家与对比基线的多代理演员-评论家方法。
展示一种专门化的评论家表示如何在一次前向传播中实现对比基线的高效计算。
在具有部分观测的去中心化星际争霸单位微观管理任务上对 COMA 进行实证评估，并与基线进行比较。

提出的方法

在训练期间使用一个中心化的评论家，其条件为关联合动作与状态信息。
定义一个逐代理的对比基线，在对代理的动作进行边缘化的同时保持其他代理的动作不变，以形成优势 A^a(s,u) = Q(s,u) - sum_{u^a'} pi^a(u^a'|tau^a) Q(s,(u^{-a},u^a')).
通过一个能为给定其他代理动作输出该代理各个动作的 Q 值的评论家高效地计算所有代理的 Q 值，从而实现一次前向传播。
将该方法植入策略梯度框架，式为 g = E_pi[ sum_a ∇_θ log pi^a(u^a|tau^a) A^a(s,u) ]。
为在策略学习中改编 TD(lambda) 评论家，并使用目标网络进行 Q 或 V 的估计。
在星际争霸单位微观管理的部分观测下对 COMA 进行实证评估，并与 IAC 基线和中心化控制进行比较。

实验结果

研究问题

RQ1在共享全局奖励下，带有对比基线的中心化评论家是否能改善去中心化代理的信用分配？
RQ2COMA 是否能在部分可观测的星际争霸单位微观管理任务中超越标准多代理演员-批评基线（IAC 变体），并与中心化控制器保持竞争？
RQ3提出的用于高效对比基线评估的评论家表示在实践中是否有效？
RQ4在有限视野下，不同多代理队伍规模和地图难度下，COMA 的表现如何？

主要发现

在所有星际争霸场景中，COMA 在胜率上超过 IAC 基线。
CENTRAL-QV 基线被 COMA 超越，表明对比基线的重要性。
由于其成形的训练信号，COMA 比中心化 V 基线学习得更快且更稳定。
在能够获取完整状态和宏行动的情况下，最佳的 COMA 代理在性能上与最先进的中心化控制器相竞争。
消融研究表明中心化评论家加对比基线对于最终性能和学习效率至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。