QUICK REVIEW

[论文解读] Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Ryan Lowe|arXiv (Cornell University)|Jun 7, 2017

Reinforcement Learning in Robotics参考文献 38被引用 1,014

一句话总结

引入 MADDPG，一种 actor-critic 方法，具有中央化 critics，在训练期间对其他代理的动作进行条件化，在去中心化执行中提升学习，在协作、竞争和混合多代理任务中表现更好。它还使用策略集成和对他人进行在线建模以提升鲁棒性。

ABSTRACT

We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

研究动机与目标

动机并分析在多智能体设置中应用传统强化学习（Q 学习和策略梯度）所面临的挑战：由于非平稳性和梯度方差较高。
提出一种通用的多智能体深度强化学习算法，集中训练与去中心化执行相结合。
实现局部执行下的学习，同时在训练阶段让中央化 critics 访问其他代理的策略。
通过在线对其他代理建模以及使用策略集合来提高稳定性和鲁棒性。

提出的方法

将 actor-critic 策略梯度扩展为带有一个对所有代理动作进行条件化的集中式 critic。
推导使用将所有代理的动作与某些状态信息作为输入的集中式 Q^{pi}_i 的梯度，以更新代理 i。
允许去中心化执行，其中每个代理仅使用本地观测。
可选地学习其他代理策略的近似以降低对精确策略信息的需求。
引入策略集合以为每个代理训练多条子策略以提升鲁棒性。

实验结果

研究问题

RQ1利用其他代理动作的集中式 critic 能否在具有本地执行的多智能体环境中稳定学习？
RQ2在训练期间对其他代理策略进行建模或近似，在未知精确策略时是否能提升性能？
RQ3子策略集合是否对抗非平稳性和对抗行为提供更鲁棒的多智能体策略？
RQ4在协作和竞争任务中，MADDPG 与单智能体 DDPG 的比较如何？
RQ5将策略梯度扩展到具有集中式 critic 的多智能体情境的优点与局限性是什么？

主要发现

MADDPG 在协作和竞争环境中均优于 DDPG 及其他基线。
使用 MADDPG 训练的代理学会了协调行为，而单智能体方法难以实现。
使用其他代理策略的近似在不降低收敛速度的情况下实现了可比的性能。
策略集合在对抗环境中提供比单策略代理更强更鲁棒的策略。
在若干场景的协作与欺骗任务中，MADDPG 实现了更高的成功率和较低的对手成功率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。