[論文レビュー] Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning
The paper surveys how non-stationarity arises in multi-agent deep RL and categorizes methods to mitigate it, including centralized critics, decentralized learning, opponent modeling, meta-learning, and communication, with open problems and future directions.
Recent developments in deep reinforcement learning are concerned with creating decision-making agents which can perform well in various complex domains. A particular approach which has received increasing attention is multi-agent reinforcement learning, in which multiple agents learn concurrently to coordinate their actions. In such multi-agent environments, additional learning problems arise due to the continually changing decision-making policies of agents. This paper surveys recent works that address the non-stationarity problem in multi-agent deep reinforcement learning. The surveyed methods range from modifications in the training procedure, such as centralized training, to learning representations of the opponent's policy, meta-learning, communication, and decentralized learning. The survey concludes with a list of open problems and possible lines of future research.
研究の動機と目的
- Motivate and define non-stationarity in multi-agent DRL and its impact on learning stability.
- Survey and categorize recent approaches addressing non-stationarity across training architectures and information assumptions.
- Identify promising directions and open problems for future research in multi-agent non-stationarity.
提案手法
- Review and categorize existing methods for non-stationarity in multi-agent DRL.
- Provide taxonomy detailing training/execution architecture, modeling, opponent information, and algorithms.
- Summarize representative algorithms and their empirical settings in a consolidated table.
実験結果
リサーチクエスチョン
- RQ1What approaches have been proposed to address non-stationarity in multi-agent deep reinforcement learning?
- RQ2How do centralized vs decentralized training, opponent modeling, meta-learning, learning representations, and communication contribute to stabilizing learning under non-stationarity?
- RQ3What are the open problems and future research directions in this area?
主な発見
| Settings | Training | Execution | Modeling | Opponent Information | Algorithm | Num agents | |
|---|---|---|---|---|---|---|---|
| Tacchetti et al. (2019) | Mixed | Centr. | Decentr. | Explicit | Obs / actions | A2C | ≥ 2 |
| Singh et al. (2019) | Mixed | Decentr. | Decentr. | No | None | PG | ≥ 2 |
| Letcher et al. (2019) | Mixed | Decentr. | Decentr. | Explicit | Parameters | PG | 2 |
| Li et al. (2019) | Mixed | Centr. | Decentr. | No | Obs / actions | DDPG | ≥ 2 |
| Al-Shedivat et al. (2018) | Comp. | Decentr. | Decentr. | No | None | PPO | 2 |
| Bansal et al. (2018) | Comp. | Decentr. | Decentr. | No | None | PPO | 2 |
| Raileanu et al. (2018) | Mixed | Centr. | Centr. | Explicit | Obs / actions | A3C | 2 |
| Mordatch and Abbeel (2018) | Coop. | Decentr. | Decentr. | No | None | PG | ≥ 2 |
| Foerster et al. (2018a) | Mixed | Decentr. | Decentr. | Explicit | Parameters | PG | 2 |
| Grover et al. (2018) | Mixed | Centr. | Centr. | Explicit | Obs / actions | PPO / DDPG | 2 |
| Rabinowitz et al. (2018) | Mixed | Centr. | Centr. | Explicit | Obs / actions | Imitation | ≥ 2 |
| Foerster et al. (2018b) | Coop. | Centr. | Decentr. | No | Obs / actions | Actor-critic | ≥ 2 |
| Lowe et al. (2017) | Mixed | Centr. | Decentr. | No | Obs / actions | DDPG | ≥ 2 |
| Foerster et al. (2017) | Mixed | Decentr. | Decentr. | No | None | Q-learning | ≥ 2 |
| Sukhbaatar et al. (2016) | Coop. | Decentr. | Decentr. | No | None | PG | ≥ 2 |
| Foerster et al. (2016a) | Coop. | Centr. | Decentr. | No | None | Q-learning | ≥ 2 |
| He et al. (2016) | Mixed | Centr. | Centr. | Implicit | Obs | Q-learning | 2 |
| Zhang and Lesser (2010) | Mixed | Decentr. | Decentr. | Explicit | Parameters | PG | 2 |
- Centralized critics with decentralized actors stabilize training by conditioning policy gradients on joint observations/actions.
- Opponent modeling and learning representations can mitigate non-stationarity and improve generalization to diverse opponents.
- Meta-learning approaches (e.g., MAML-inspired) enable rapid adaptation to non-stationary dynamics.
- Self-play and stabilized experience replay are effective decentralized strategies under non-stationarity.
- Communication among agents emerges as a useful mechanism to coordinate policies and stabilize learning in multi-agent settings.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。