QUICK REVIEW

[论文解读] A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot, Vinícius Zambaldi|arXiv (Cornell University)|Nov 2, 2017

Reinforcement Learning in Robotics参考文献 61被引用 142

一句话总结

本文提出 Policy-Space Response Oracles (PSRO) 和 Deep Cognitive Hierarchies (DCH) 以应用于 MARL，量化联合策略相关性 (JPC) 并通过基于元策略的策略选择产生通用策略，提供可扩展实现并在网格世界和 Leduc Poker 中进行测试。

ABSTRACT

To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (non-stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as InRL, iterated best response, double oracle, and fictitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker.

研究动机与目标

量化独立强化学习策略对其他代理的过拟合程度（联合策略相关性，JPC）。
开发一个通用的 MARL 框架（PSRO），统一先前的方法并支持深度强化学习策略。
提出可扩展实现（DCH），具备对部分可观测环境中实际 MARL 的解耦元求解器。
在网格世界协调博弈和 Leduc 扑克中展示该方法的普遍性与鲁棒性。

提出的方法

将 Double Oracle 推广为 Policy-Space Response Oracles (PSRO)，其中元博弈的行动是策略而非动作。
使用深度强化学习来计算对手策略混合的最佳应对。
采用经验博弈论分析（EGTA）在策略空间上计算元策略。
引入 Deep Cognitive Hierarchies (DCH)：PSRO 的并行、固定深度、多进程实现，以扩展训练规模。
采用解耦的元策略求解器（如 regret-matching、Hedge、投影复制动力学）并加入探索以促进多样性。
在集中训练、分散执行设置下提供策略为神经网络、可选的中心 payoff 张量 U^Π 的方案。

实验结果

研究问题

RQ1当多个代理独立学习时，过拟合的程度有多高（用 JPC 量化）？
RQ2PSRO/DCH 是否能够产生在对手行为和部分可观测性变化下仍表现良好的通用、鲁棒策略？
RQ3哪种元策略求解器和探索程度最能在收敛性、可利用性与泛化之间取得平衡？
RQ4与 NFSP 和基于 CFR 的方法相比，PSRO/DCH 在收敛速度和对固定智能体的可利用性方面在不同设置中的表现如何？

主要发现

独立学习者在与其他独立学习策略配对时，存在显著的联合策略相关性（JPC）损失。
Deep Cognitive Hierarchies (DCH) 显著降低 JPC，在较大且部分可观测性更高的地图上降幅高达 71.7%，且随地图规模增加而增大。
PSRO/DCH 在 Leduc 扑克中产生鲁棒的对策，初始收敛速度快于 NFSP，在对抗固定博弈者时的可利用性具有竞争力。
DCH 通过解耦元求解器和在线更新，提供了可扩展、占用空间更低的替代方案，使在复杂环境中进行实际多智能体学习成为可能。
与基线方法相比，PSRO/DCH 在可利用性与泛化之间取得平衡，学得的策略能够适应一系列对手，而非对单一均衡过拟合。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。