Skip to main content
QUICK REVIEW

[论文解读] Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Christian Schroeder de Witt, Tarun Gupta|arXiv (Cornell University)|Nov 18, 2020
Reinforcement Learning in Robotics参考文献 33被引用 182
一句话总结

本文表明独立学习方法 Independent PPO (IPPO) 可以在 SMAC 上与最先进的集中式训练+分散式执行方法相匹配或超越,且超参数调优有限。它还分析了策略裁剪和中央状态信息的作用,表明在 SMAC 上相对过度泛化可能不像理论预测的那样成问题。

ABSTRACT

Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity.

研究动机与目标

  • 促使重新评估在像 SMAC 这样的协作型多智能体强化学习任务中独立学习的可行性。
  • 在困难的 SMAC 地图上,将 IPPO 与集中值函数方法如 QMIX、MAVEN 和 MAPPO 进行对比评估。
  • 研究 IPPO 表现良好的原因,重点关注 PPO 裁剪以及训练过程中对中央状态信息的利用。

提出的方法

  • 提出独立 PPO(IPPO),其中每个智能体基于独立目标学习带裁剪的局部策略。
  • 使用在各智能体之间共享的局部评价者 V_phi(z_t^a),对每个智能体应用广义优势估计(GAE)。
  • 按式(4)通过时差误差(TD)和多步 GAE 定义每个智能体的优势 A_t^a。
  • 在 PPO 目标中应用策略裁剪,并在式(6)中可选地应用值裁剪以限制评价者更新。
  • 在评价者之间共享网络参数、在执行者之间也共享网络参数的条件下进行训练,采用集中训练-分散执行设置。
  • 在 16 张 SMAC 地图上进行温和的地图特定超参数调整并与 QMIX、IQL、MAPPO 和 MAVEN 进行比较。

实验结果

研究问题

  • RQ1Does IPPO match or exceed state-of-the-art CTDE MARL methods on SMAC across varied maps?
  • RQ2How do PPO-specific components like policy clipping and value clipping influence performance in independent learning for cooperative MARL?
  • RQ3What is the impact of conditioning critics on full state information during centralized training for IPPO?
  • RQ4Is relative overgeneralisation a practical obstacle for IPPO on SMAC maps?
  • RQ5How do IPPO results compare to independent baselines (IAC, IQL) and centralized baselines (QMIX, MAPPO, MAVEN) on hard SMAC maps?

主要发现

  • IPPO significantly outperforms MAPPO and QMIX on several hard SMAC maps.
  • IPPO beats IQL and IAC and shows greater stability across many maps.
  • Policy clipping is essential to IPPO’s performance, while value clipping improves some maps.
  • Using full central state information for the critic can be worse than local critics on hard maps, indicating central state information is not universally beneficial in SMAC.
  • Reducing the effective learning rate via IPPO’s clipping cannot be replicated by simply lowering the learning rate in IAC, suggesting clipping provides a unique stabilizing benefit.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。