QUICK REVIEW

[論文レビュー] The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

Chao Yu, Akash Velu|arXiv (Cornell University)|Mar 2, 2021

Reinforcement Learning in Robotics被引用数 587

ひとこと要約

PPOベースの手法は、最小限のチューニングとドメイン固有の変更なしで、複数の協調型MARLベンチマークにおいて最先端と競合するまたはそれを上回る成果を達成し、多-agent環境でPPOがサンプル効率に劣るという信念に挑戦する。

ABSTRACT

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.

研究の動機と目的

協調型マルチエージェント強化学習（MARL）設定におけるPPOの再評価を促す。
PPOベースの手法（MAPPOとIPPO）を、複数のMARLベンチマークで強力なオフポリシーベースラインと比較評価する。
MARLにおけるPPOの性能を左右する主要な実装要因とハイパーパラメータを特定し、実用的なチューニング指針を提供する。

提案手法

PPOを多-agent設定に適応させ、MAPPO（集中化された価値関数入力）とIPPO（独立したエージェント）として適用。
同質エージェントのパラメータ共有を用いて学習効率を向上。
GAE（Generalized Advantage Estimation）を、アドバンテージ正規化と値クリッピングとともに適用。
値関数入力、値正規化、学習データの使用、クリッピング、バッチサイズを重要な要因として検討。
4つの環境に渡り、オフポリシーベースライン（QMix、MADDPG、RODE など）と比較評価。
ソースコードを Marl Benchmark on-policy リポジトリで公開。

実験結果

リサーチクエスチョン

RQ1PPOベースの手法は、多様な協調ベンチマークにおいてオフポリシ MARL ベースラインと同等または上回る性能を達成できるか？
RQ2PPOの性能に最も強く影響を与える実装の選択とハイパーパラメータは何か？
RQ3集中化された値関数入力（MAPPO）は、マルチエージェント協調において独立したPPO（IPPO）より利点を提供するか？
RQ4MARLのためにPPOを効果的にチューニングするために、どのような実用的な指針を導き出せるか？
RQ5エージェントの均質性や観測構造が異なるさまざまな環境に対して、PPOベースの手法は頑健か？

主な発見

MAPPOとIPPOは、MPE、SMAC、GRF、Hanabiにおいて最終パフォーマンスで競争力があり、オフポリシーベースラインと同等のサンプル効率を達成する。
MAPPOは集中化された値入力を持つ場合、RODEおよび他のオフポリシー手法と同等かそれを上回ることが多く、いくつかのSMACマップで達成。
同じ訓練予算の下で、Google FootballのシナリオでMAPPOはQMixを上回る。
5つの実用的要因（値正規化、値関数入力、訓練データの使用、ポリシー/値のクリッピング、バッチサイズ）は、PPOのMARL性能に強く影響し、明確なベストプラクティス指針がある。
値正規化は価値学習を安定化させ、いくつかのベンチマークで最終パフォーマンスを向上させる。
局所観測とグローバル状態を組み合わせた集中化された値入力（AS/FP）は、通常、純粋に結合された局所観測や純粋に環境提供のグローバルよりも優れている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。