QUICK REVIEW

[论文解读] RODE: Learning Roles to Decompose Multi-Agent Tasks

Tonghan Wang, Tarun Gupta|arXiv (Cornell University)|Oct 4, 2020

Reinforcement Learning in Robotics参考文献 98被引用 60

一句话总结

RODE 自动通过基于行动效果的聚类来发现角色，创建一个双层学习框架，将多智能体任务分解为更小、可迁移的子任务。

ABSTRACT

Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy -- the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos are available at https://sites.google.com/view/rode-marl .

研究动机与目标

通过基于角色的分解促进可扩展的多智能体学习。
在无需手工设计的情况下自动发现一组有效的角色。
通过对行动效果对联合行动空间进行因式分解，降低学习复杂度。
实现将学习到的策略快速迁移到具有不同数量的代理/行动的环境。

提出的方法

学习能够编码对观测和奖励的行动效果的行动表征，使用前向预测模型。
在表示空间中对行动进行聚类，形成受限的角色行动空间。
引入双层层级：在高层有角色选择器，在受限行动空间中运行的角色策略。
将角色表征计算为行动表征的平均值，以为角色选择提供信息。
使用 QMIX 风格的混合网络来优化联合回报，学习角色策略和角色选择器。
端到端训练，针对角色选择器和角色策略使用时序差分损失，并利用全局奖励。

实验结果

研究问题

RQ1基于行动效果的表征是否能够有效聚类行动以实现基于角色的分解？
RQ2在包含大量智能体的环境中，限制角色行动空间是否提升学习效率和策略性能？
RQ3RODE 能否将学习到的策略迁移到具有不同数量代理或行动的问题上？
RQ4各个组件（行动表征、受限行动空间、分层学习）对整体性能的贡献如何？
RQ5在如 StarCraft II 微管理这样的具有挑战性的多智能体基准上，RODE 的表现如何？

主要发现

RODE 在 StarCraft II 微管理的14张地图中有10张达到最先进水平（其中包括全部9张困难和超难地图）。
RODE 展现出对比训练设置多三倍代理数量的环境的快速迁移能力。
行动表征能够有效揭示与功能相似性相关的行动簇（如朝向/远离敌人移动、对类似单位类型进行攻击等）。
消融研究表明限制角色行动空间和利用行动效果信息对于比基线的提升至关重要，而使用完整行动空间或随机限制并未带来类似收益。
RODE 的分层设计（包含角色选择器和角色策略）在结合基于效果的行动因子化后，提供了一个可扩展的学习框架。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。