QUICK REVIEW

[論文レビュー] Dynamics of Multi-Agent Actor-Critic Learning in Stochastic Games: from Multistability and Chaos to Stable Cooperation

Yuxin Geng, Wolfram Barfuß|arXiv (Cornell University)|Jan 12, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

この論文は確率ゲームにおけるエントロピー正則化付きマルチエージェントActor-Critic学習を分析し、Matching Penniesでカオスを、Prisoner’s Dilemmaで多安定性を示し、エントロピーが安定した協調を促進し MARL を進化的ゲーム理論へリンクさせる。

ABSTRACT

Achieving robust coordination and cooperation is a central challenge in multi-agent reinforcement learning (MARL). Uncovering the mechanisms underlying such emergent behaviors calls for a dynamical understanding of learn processes. In this work, we investigate the dynamics of actor-critic agents in stochastic games, focusing on the impact of entropy regularization. By leveraging time-scale separation, we derive the system's evolution equations, which are then formally analyzed using dynamical systems theory. We find that in the constant-sum game of Matching Pennies, the system exhibits chaotic behavior. Entropy regularization mitigates this chaos and drives the dynamics toward convergence to fair cooperation. In contrast, in the general-sum game of the Prisoner's Dilemma, the system displays multistability. Interestingly, the three stable equilibria of the system correspond to the well-known ALLC (Always Cooperate), ALLD (Always Defect), and GRIM (Grim Trigger) strategies from evolutionary game theory (EGT). Entropy regularization strengthens system resilience by enlarging the basin of attraction of the cooperative equilibrium. Our findings reveal a close link between the mechanism of direct reciprocity in EGT and how cooperation emerges in MARL, offering insights for designing more robust and collaborative multi-agent systems.

研究の動機と目的

ROBUSTな協調と協力をMARLで実現するため、確率ゲームにおけるエントロピー正則化下の学習ダイナミクスを研究する。
エントロピー正則化付きA2Cの連続時間ダイナミクス（ODEs）を導出・分析し、平衡・安定性・分岐を理解する。
二つの古典的な二状態ゲーム：Matching PenniesとPrisoner’s Dilemmaで所見を示す。
MARLにおける協調のメカニズムとEGTの概念との関係を探る。

提案手法

Boltzmann行動選択と目的関数にエントロピー項を含む確率ゲームにおけるエントロピー正則化A2Cを定式化する。
相互作用、クリティック更新、アクター更新を二重時刻スケールで分離し、決定論的ODEsを得る。
方針更新をQ値・V値およびアドバンテージ関数の観点で表現し、方針空間のダイナミカル系を得る。
方針適合性の多様体上の平衡を分析し、内部平衡を量子的反応均衡（QRE）と関連づける。
MPとPDに対してダイナミカルシステム手法を適用し、カオス、多安定性、およびエントロピーによる安定化効果を特徴づける。

実験結果

リサーチクエスチョン

RQ1エントロピー正則化と二重時刻スケール学習は、確率ゲームにおけるマルチエージェントACダイナミクスの安定性と収束にどのような影響を与えるか？
RQ2代表的な二状態ゲーム（MPとPD）におけるエントロピー正則化付きA2Cではどのような平衡が現れ、それらはEGT戦略とどのように関連するか？
RQ3エントロピー正則化はカオスを抑制し協力を促進するか、また引力域（basins of attraction）にどのような影響を及ぼすか？
RQ4MARLのダイナミクスは進化的ゲーム理論の直接的な報礼性機構とどのように接続されるか？

主な発見

エントロピーなしのMatching Penniesでは割引因子が大きくなると学習軌道がカオスになる可能性があるが、エントロピー正則化はカオスを抑制し公正な協力へ収束させる。
Prisoner’s DilemmaではALLC・ALLD・GRIMに対応する3つの安定平衡を示す；エントロピー正則化は協力の引力域を広げる。
MPにおける内部協力平衡はエントロピーの下で全局的に引力を持ち、永続的な振動を抑制し公正な結果を生む。
PDの分析は古典的EGT結果に対応する直接報復条件を示し、エントロピーは協力を強化する突然変異のような機構として機能する。
EGTの直接報復とMARLにおける協力の出現との正式なリンクが、導出されたODEダイナミクスとQREの関係を通じて確立される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。