QUICK REVIEW

[论文解读] UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

Siyi Hu, Fengda Zhu|arXiv (Cornell University)|Jan 20, 2021

Reinforcement Learning in Robotics参考文献 35被引用 32

一句话总结

UPDeT 引入一种通用的、基于变换器的策略解耦框架，用于多智能体强化学习，能够处理可变的输入/输出维度，并实现跨任务的快速迁移，优于基于 RNN 的方法。

ABSTRACT

Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task. The limitation is due to the restricted model architecture related to fixed input and output dimensions. This hinders the experience accumulation and transfer of the learned agent over tasks with diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multi-agent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks with the requirement of different observation and action configurations. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation with an importance weight measured by the merits of the self-attention mechanism. Compared to a standard transformer block, the proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable. UPDeT is general enough to be plugged into any multi-agent reinforcement learning pipeline and equip them with strong generalization abilities that enables the handling of multiple tasks at a time. Extensive experiments on large-scale SMAC multi-agent competitive games demonstrate that the proposed UPDeT-based multi-agent reinforcement learning achieves significant results relative to state-of-the-art approaches, demonstrating advantageous transfer capability in terms of both performance and training speed (10 times faster).

研究动机与目标

激发需要一种在不同观测和动作配置下都能工作的通用 MARL 架构的需求。
提出一个基于变换器的独立函数，将策略与输入观测解耦。
引入策略解耦，将匹配的 observation-entity 映射到 action-groups，而不新增特定于任务的参数。
在保持可解释性的同时，实现在多任务之间的迁移和对不同 MARL 任务的更快适应。

提出的方法

将观测表示为 observation-entities，并使用基于变换器的函数对其进行嵌入，以计算每个智能体的 Q 值。
通过一个信用分配函数从单独的 Q 值计算全局 Q 函数。
使用自注意力学习匹配的 observation-entity 与其他观测之间的关系，从而实现策略解耦。
将动作划分为与 observation-entities 匹配的 action-groups，以在不增加额外参数的情况下实现灵活的策略维度。
在 Dec-POMDP 设置中整合时间单元（全局或个体）以处理动作-观测历史。
使用与 DQN 中相同的 TD 误差进行优化，用基于变换器的时间单元替代 GRU/LSTM。

实验结果

研究问题

RQ1单一架构是否能在不增加新的任务特定参数的情况下支持输入/输出维度各不相同的 MARL 任务？
RQ2通过变换器实现的策略解耦是否能提升表征学习和跨多个 MARL 任务的迁移？
RQ3将 UPDeT 集成到现有 MARL 流水线（VDN、QMIX、QTRAN）中，在性能和迁移速度方面的表现如何？
RQ4在部分可观测性下，不同时间单元设计对学习的影响如何？
RQ5注意力机制是否能够为多智能体情境中的策略决策提供可解释的洞见？

主要发现

在具有挑战性的 SMAC 场景中，与 VDN、QMIX 或 QTRAN 搭配时，UPDeT 显著优于基于 RNN 的模型。
该方法在跨任务上实现了强大的迁移能力并显著降低训练成本，报道的迁移收敛成本至少比 GRU-based 模型快 10 倍、比从零开始训练快 100 倍。
基于注意力引导的策略解耦产生可解释的策略，其证据是注意力图与类似 STARCRAFT 的设定中的 Startup、Attack、Survival 等战略阶段相关。
UPDeT 可以几乎不需要架构上的改动地接入现有的 MARL 方法，并在从简单到困难的场景中带来显著的性能提升。
该方法可扩展到大规模多智能体系统（MAS）设置，并在不同任务规模下展示出稳健的泛化和迁移能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。