QUICK REVIEW

[论文解读] "Other-Play" for Zero-Shot Coordination

Hengyuan Hu, Adam Lerer|arXiv (Cornell University)|Mar 6, 2020

Reinforcement Learning in Robotics参考文献 52被引用 33

一句话总结

论文提出 Other-Play (OP)，一种基于对称性的元学习方法，通过优化对对手策略的对称性破坏的鲁棒性来提升零-shot 协作，在 Hanabi 和杠杆游戏中有所展示。

ABSTRACT

We consider the problem of zero-shot coordination - constructing AI agents that can coordinate with novel partners they have not seen before (e.g. humans). Standard Multi-Agent Reinforcement Learning (MARL) methods typically focus on the self-play (SP) setting where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores when paired with independently trained agents. In preliminary results we also show that our OP agents obtains higher average scores when paired with human players, compared to state-of-the-art SP agents.

研究动机与目标

在测试时伙伴未知的情况下，激发财零-shot 协作。
提出 OP，以最大化对伙伴之间对称性破缺的鲁棒性。
从理论上刻画 OP，并将其展示为一个置换不变的元均衡。
在合作任务中用深度强化学习演示 OP，并与自我博弈进行比较。
评估 OP 在 Hanabi 中与 AI 代理和人类的表现。

提出的方法

将对称性 Phi 定义为对状态、动作和观测的双射，保持 Dec-POMDP 不变。
给出 OP 目标：在与对称等价策略的伙伴匹配时最大化期望回报：J_OP = E_{phi ~ Phi}[J(pi^1, phi(pi^2))]。
证明 OP 策略等价于对对称性作用的策略 pi_Phi 的一致混合。
通过在训练期间用从 Phi 均匀采样的 phi 随机化伙伴策略，将 OP 应用于深度 RL（领域随机化）。
证明 OP 与任何基于 SP 的优化兼容，并将 SP 扩展到置换不变的均衡。

实验结果

研究问题

RQ1在协作多智能体设置中，我们如何实现与先前未见伙伴的鲁棒协调？
RQ2是否可以利用对称性考虑来提升超越标准自我博弈的零-shot 协作？
RQ3Other-Play 的理论性质和均衡保证是什么？
RQ4在如 Hanabi 这样的复杂部分可观任务中，OP 对 AI 与人类的表现如何？

主要发现

方法	跨对局	跨对局(*)	自我博弈
SAD	2.52 ± 0.34	3.02 ± 0.39	23.97 ± 0.04
SAD + OP	15.32 ± 0.65	18.28 ± 0.36	23.93 ± 0.02
SAD + AUX	17.65 ± 0.69	21.09 ± 0.18	24.09 ± 0.03
SAD + AUX + OP	22.07 ± 0.11	22.49 ± 0.18	24.06 ± 0.02

与独立训练的代理搭档时，OP 相较于标准 SP 取得更高的零-shot 协作。
在杠杆游戏中，OP 在训练和测试阶段都收敛到唯一的 0.9 回报选项，与 SP 不同。
在 Hanabi 中，使用 OP 的跨对局得分提高，尤其是对较简单模型（SAD 变体）。
在测试的配置中，SAD + AUX + OP 提供了最高的跨对局表现。
与 OP 机器人搭档的人类平均得分高于与 SP 机器人搭档的得分（15.75 vs 9.15）。
OP 减少了与 SP 代理相关的“非人类”约定的出现，并带来更易解释的协作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。