QUICK REVIEW

[论文解读] Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Christian Schroeder de Witt, Tarun Gupta|arXiv (Cornell University)|Nov 18, 2020

Reinforcement Learning in Robotics参考文献 33被引用 182

一句话总结

本文表明独立学习方法 Independent PPO (IPPO) 可以在 SMAC 上与最先进的集中式训练+分散式执行方法相匹配或超越，且超参数调优有限。它还分析了策略裁剪和中央状态信息的作用，表明在 SMAC 上相对过度泛化可能不像理论预测的那样成问题。

ABSTRACT

Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity.

研究动机与目标

促使重新评估在像 SMAC 这样的协作型多智能体强化学习任务中独立学习的可行性。
在困难的 SMAC 地图上，将 IPPO 与集中值函数方法如 QMIX、MAVEN 和 MAPPO 进行对比评估。
研究 IPPO 表现良好的原因，重点关注 PPO 裁剪以及训练过程中对中央状态信息的利用。

提出的方法

提出独立 PPO（IPPO），其中每个智能体基于独立目标学习带裁剪的局部策略。
使用在各智能体之间共享的局部评价者 V_phi(z_t^a)，对每个智能体应用广义优势估计（GAE）。
按式（4）通过时差误差（TD）和多步 GAE 定义每个智能体的优势 A_t^a。
在 PPO 目标中应用策略裁剪，并在式（6）中可选地应用值裁剪以限制评价者更新。
在评价者之间共享网络参数、在执行者之间也共享网络参数的条件下进行训练，采用集中训练-分散执行设置。
在 16 张 SMAC 地图上进行温和的地图特定超参数调整并与 QMIX、IQL、MAPPO 和 MAVEN 进行比较。

实验结果

研究问题

RQ1Does IPPO match or exceed state-of-the-art CTDE MARL methods on SMAC across varied maps?
RQ2How do PPO-specific components like policy clipping and value clipping influence performance in independent learning for cooperative MARL?
RQ3What is the impact of conditioning critics on full state information during centralized training for IPPO?
RQ4Is relative overgeneralisation a practical obstacle for IPPO on SMAC maps?
RQ5How do IPPO results compare to independent baselines (IAC, IQL) and centralized baselines (QMIX, MAPPO, MAVEN) on hard SMAC maps?

主要发现

IPPO significantly outperforms MAPPO and QMIX on several hard SMAC maps.
IPPO beats IQL and IAC and shows greater stability across many maps.
Policy clipping is essential to IPPO’s performance, while value clipping improves some maps.
Using full central state information for the critic can be worse than local critics on hard maps, indicating central state information is not universally beneficial in SMAC.
Reducing the effective learning rate via IPPO’s clipping cannot be replicated by simply lowering the learning rate in IAC, suggesting clipping provides a unique stabilizing benefit.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。