QUICK REVIEW

[論文レビュー] "Other-Play" for Zero-Shot Coordination

Hengyuan Hu, Adam Lerer|arXiv (Cornell University)|Mar 6, 2020

Reinforcement Learning in Robotics参考文献 52被引用数 33

ひとこと要約

本論文は Other-Play（OP）を導入する。これは対称性ベースのメタ学習アプローチで、パートナー方針の対称性破壊に対するロバスト性を最適化することでゼロショット協調を改善し、Hanabiとレバーゲームで実証される。

ABSTRACT

We consider the problem of zero-shot coordination - constructing AI agents that can coordinate with novel partners they have not seen before (e.g. humans). Standard Multi-Agent Reinforcement Learning (MARL) methods typically focus on the self-play (SP) setting where agents construct strategies by playing the game with themselves repeatedly. Unfortunately, applying SP naively to the zero-shot coordination problem can produce agents that establish highly specialized conventions that do not carry over to novel partners they have not been trained with. We introduce a novel learning algorithm called other-play (OP), that enhances self-play by looking for more robust strategies, exploiting the presence of known symmetries in the underlying problem. We characterize OP theoretically as well as experimentally. We study the cooperative card game Hanabi and show that OP agents achieve higher scores when paired with independently trained agents. In preliminary results we also show that our OP agents obtains higher average scores when paired with human players, compared to state-of-the-art SP agents.

研究の動機と目的

テスト時にパートナーが未知のままであるゼロショット協調を動機づける。
OP を提案し、パートナー間の対称性破壊に対するロバスト性を最大化する。
OPを理論的に特徴づけ、それを置換不変なメタ均衡として示す。
協力タスクにおける深層強化学習を用いた OP を実証し、自己対戦と比較する。
Hanabi における AI エージェントと人間を対象に OP の性能を評価する。

提案手法

Dec-POMDP を不変にする状態・行動・観測の全単射として対称性 Φ を定義する。
OP の目的を定式化：パートナーの対称性等価な方策とマッチングした場合の期待報酬を最大化する。 J_OP = E_{phi ~ Phi}[J(pi^1, phi(pi^2))].
OP 方策が対称性を適用した方策 pi_Phi の一様混合に対応することを証明する。
訓練時にΦから一様にサンプルした phi によってパートナー方策をランダム化することで、深層強化学習に OP を実装する（ドメインランダム化）。
OP が任意の SP ベースの最適化と互換性をもち、SP を置換不変な均衡へ拡張することを示す。

実験結果

リサーチクエスチョン

RQ1協力的マルチエージェント環境で、事前に未知のパートナーとどのように頑健な協調を達成できるか？
RQ2対称性の考慮を活用して、標準的な自己対戦を超えたゼロショット協調を改善できるか？
RQ3Other-Play の理論的性質と均衡の保証は何か？
RQ4OP は AI と人間を含むHanabi のような複雑な部分的観測タスクでどのように機能するか？

主な発見

手法	クロスプレイ	クロスプレイ(*)	自己対戦
SAD	2.52 ± 0.34	3.02 ± 0.39	23.97 ± 0.04
SAD + OP	15.32 ± 0.65	18.28 ± 0.36	23.93 ± 0.02
SAD + AUX	17.65 ± 0.69	21.09 ± 0.18	24.09 ± 0.03
SAD + AUX + OP	22.07 ± 0.11	22.49 ± 0.18	24.06 ± 0.02

OP は、標準 SP と比較して独立して学習されたエージェントと組んだ場合にゼロショット協調が向上する。
レバーゲームでは、OP は訓練時とテスト時のいずれも唯一の 0.9 報酬オプションへ収束するのに対し、SP はそうならない。
Hanabi では、OP によるクロスプレイスコアが改善され、特に簡易モデル（SAD variant）で効果が大きい。
SAD + AUX + OP は、試験した構成の中で最も高いクロスプレイ性能を示す。
人間が OP ボットと組んだ場合、SP ボットと組んだ場合より平均スコアが高かった（15.75 対 9.15）。
OP は SP エージェントで観察される“非人間的”慣習の出現を抑制し、より解釈しやすい協調を生む。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。