QUICK REVIEW

[論文レビュー] Multi-Agent Reinforcement Learning is a Sequence Modeling Problem

Muning Wen, Jakub Grudzien Kuba|arXiv (Cornell University)|May 30, 2022

Reinforcement Learning in Robotics被引用数 79

ひとこと要約

本論文は Multi-Agent Transformer (MAT) を提案し、協調的 MARL をシーケンスモデリングタスクとして扱うエンコーダ-デコーダアーキテクチャで、モノトニックな改善と線形時間更新を可能にするオンポリシー訓練を実現します。MAT は最先端の結果と強力な一般化を、複数の MARL ベンチマークで示します。

ABSTRACT

Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.

研究の動機と目的

協調的 MARL をシーケンスモデリングと結びつけ、現代のシーケンスモデルを活用する。
ジョイントポリシー探索を逐次決定プロセスへ転換し、加法的複雑さを保つ。
モノトニックな性能向上を保証するオンライン・オンポリシー訓練パラダイムを提供する。
MAT の優位性と多様な MARL ベンチマークにおける一般化を実証する。

提案手法

MAT をエンコーダ-デコーダ構造で導入する。
エンコーダを用いてエージェント観測の列を潜在表現へ写像する。
デコーダをマスク付き注意機構で用い、 predecessors を条件としてエージェントの行動を逐次生成する。
共同最適化のために PPO 風のクリップ付き目的関数と GAE-like アドバンテージで訓練する。
比較のための分散型ベースラインとして CTDE 変種（MAT-Dec）を提供する。
多エージェントアドバンテージ分解定理を用いて、モノトニック改善保証を示す。

実験結果

リサーチクエスチョン

RQ1協調的 MARL の問題は、エンコーダ-デコーダアーキテクチャを用いて効果的にシーケンスモデリング問題としてモデル化できるか？
RQ2Transformer ベースの MAT は、標準的な MARL ベンチマークで強力な baselines と比較して優れた性能とデータ効率を達成できるか？
RQ3MAT は見たことのないタスクやエージェントの数/種類の変化（Few-shot/Zero-shot 設定）に一般化できるか？

主な発見

Task	Difficulty	MAT	MAT-Dec	MAPPO	HAPPO	QMIX	UPDeT	Steps
3m	Easy	100.0 (1.8)	100.0 (1.1)	100.0 (0.4)	100.0 (1.2)	96.9 1.3	100.0 (5.2)	5e5
8m	Easy	100.0 (1.1)	97.5 (2.5)	96.8 (2.9)	97.5 (1.1)	97.7 1.9	96.3 (9.7)	1e6
1c3s5z	Easy	100.0 (2.4)	100.0 (0.4)	100.0 (2.2)	97.5 (1.8)	96.9 (1.5)	/	2e6
MMM	Easy	100.0 (2.2)	98.1 (2.1)	95.6 (4.5)	81.2 (22.9)	91.2 (3.2)	/	2e6
2c vs 64zg	Hard	100.0 (1.3)	95.9 (2.3)	100.0 (2.7)	90.0 (4.8)	90.3 (4.0)	/	5e6
3s vs 5z	Hard	100.0 (1.7)	100.0 (1.3)	100.0 (2.5)	91.9 (5.3)	92.3 (4.4)	/	5e6
3s5z	Hard	100.0 (1.9)	100.0 (3.3)	72.5 (26.5)	90.0 (3.5)	84.3 (5.4)	/	3e6
5m vs 6m	Hard	90.6 (4.4)	83.1 (4.6)	88.2 (6.2)	73.8 (4.4)	75.8 (3.7)	90.6 (6.1)	1e7
8m vs 9m	Hard	100.0 (3.1)	95.0 (4.6)	93.8 (3.5)	86.2 (4.4)	92.6 (4.0)	/	5e6
10m vs 11m	Hard	100.0 (1.4)	100.0 (2.0)	96.3 (5.8)	77.5 (9.7)	95.8 (6.1)	/	5e6
25m	Hard	100.0 (1.3)	86.9 (5.6)	100.0 (2.7)	0.6 (0.8)	90.2 (9.8)	2.8 (3.1)	2e6
27m vs 30m	Hard+	100.0 (0.7)	95.3 (2.2)	93.1 (3.2)	0.0 (0.0)	39.2 (8.8)	/	1e7
MMM2	Hard+	93.8 (2.6)	91.2 (5.3)	81.8 (10.1)	0.3 (0.4)	88.3 (2.4)	/	1e7
6h vs 8z	Hard+	98.8 (1.3)	93.8 (4.7)	88.4 (5.7)	0.0 (0.0)	9.7 (3.1)	/	1e7
3s5z vs 3s6z	Hard+	96.5 (1.3)	85.3 (7.5)	84.3 (19.4)	82.8 (21.2)	68.8 (21.2)	/	2e7
3s6z	Hard+	?	?	?	?	?	/	2e7
5m vs 6m	Hard	90.6 (4.4)	83.1 (4.6)	88.2 (6.2)	73.8 (4.4)	75.8 (3.7)	90.6 (6.1)	1e7

MAT は複数の MARL ベンチマークで MAPPO、HAPPO、QMIX、UPDeT より優れた性能とデータ効率を達成する。
MAT は逐次更新スキームを活用しつつ、モノトニックな改善保証を保つことができる。
MAT はエージェント数や失敗モードの異なるタスクに対して強力な Few-shot および Zero-shot の一般化を示す。
デコーダは完全に並列化された訓練サイクルを可能にし、厳密に逐次的な方法より学習を加速する。
CTDE 変種（MAT-Dec）は、性能向上のためには MAT デコーダの重要性を確認する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。