QUICK REVIEW

[Paper Review] The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

Chao Yu, Akash Velu|arXiv (Cornell University)|Mar 2, 2021

Reinforcement Learning in Robotics587 citations

TL;DR

PPO-based methods, with minimal tuning and no domain-specific changes, achieve competitive to state-of-the-art results across multiple cooperative MARL benchmarks, challenging the belief that PPO is less sample-efficient in multi-agent settings.

ABSTRACT

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.

Motivation & Objective

Motivate re-evaluating PPO in cooperative multi-agent reinforcement learning (MARL) settings.
Evaluate PPO-based methods (MAPPO and IPPO) against strong off-policy baselines on multiple MARL benchmarks.
Identify key implementation and hyperparameter factors that drive PPO performance in MARL and provide practical tuning guidance.

Proposed method

Adapt PPO into multi-agent settings as MAPPO (centralized value function inputs) and IPPO (independent agents).
Use parameter sharing for homogeneous agents to improve learning efficiency.
Apply Generalized Advantage Estimation (GAE) with advantage normalization and value clipping.
Investigate value function inputs, value normalization, training data usage, clipping, and batch size as critical factors.
Benchmark against off-policy baselines (QMix, MADDPG, RODE, etc.) across four environments.
Release source code at the Marl Benchmark on-policy repository.

Experimental results

Research questions

RQ1Can PPO-based methods achieve competitive or superior performance to off-policy MARL baselines across diverse cooperative benchmarks?
RQ2What implementation choices and hyperparameters most strongly influence PPO performance in MARL?
RQ3Do centralized value-function inputs (MAPPO) offer advantages over independent PPO (IPPO) in multi-agent cooperation?
RQ4What practical guidelines can be derived to effectively tune PPO for MARL?
RQ5Are PPO-based methods robust to different environments with varying agent homogeneity and observation structures?

Key findings

MAPPO and IPPO achieve competitive or superior final performance and similar sample efficiency to off-policy baselines on MPE, SMAC, GRF, and Hanabi.
MAPPO with centralized value inputs often matches or surpasses RODE and other off-policy methods across several SMAC maps.
MAPPO outperforms QMix in Google Football scenarios under the same training budget.
Five practical factors (value normalization, value function inputs, training data usage, policy/value clipping, and batch size) strongly influence PPO performance in MARL and have clear best-practice guidance.
Value normalization stabilizes value learning and improves final performance in several benchmarks.
Centralized value inputs that combine local observations with global state (AS/FP) generally outperform purely concatenated local observations or purely environment-provided globals.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.