QUICK REVIEW

[论文解读] Dynamics of Multi-Agent Actor-Critic Learning in Stochastic Games: from Multistability and Chaos to Stable Cooperation

Yuxin Geng, Wolfram Barfuß|arXiv (Cornell University)|Jan 12, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

本论文分析在随机博弈中的熵正则化多智能体演员-评论家学习，展示在 Matching Pennies 中的混沌以及在 Prisoner’s Dilemma 中的多稳态性，熵促进稳定合作并将 MARL 与进化博弈理论联系起来。

ABSTRACT

Achieving robust coordination and cooperation is a central challenge in multi-agent reinforcement learning (MARL). Uncovering the mechanisms underlying such emergent behaviors calls for a dynamical understanding of learn processes. In this work, we investigate the dynamics of actor-critic agents in stochastic games, focusing on the impact of entropy regularization. By leveraging time-scale separation, we derive the system's evolution equations, which are then formally analyzed using dynamical systems theory. We find that in the constant-sum game of Matching Pennies, the system exhibits chaotic behavior. Entropy regularization mitigates this chaos and drives the dynamics toward convergence to fair cooperation. In contrast, in the general-sum game of the Prisoner's Dilemma, the system displays multistability. Interestingly, the three stable equilibria of the system correspond to the well-known ALLC (Always Cooperate), ALLD (Always Defect), and GRIM (Grim Trigger) strategies from evolutionary game theory (EGT). Entropy regularization strengthens system resilience by enlarging the basin of attraction of the cooperative equilibrium. Our findings reveal a close link between the mechanism of direct reciprocity in EGT and how cooperation emerges in MARL, offering insights for designing more robust and collaborative multi-agent systems.

研究动机与目标

通过研究在随机博弈中的熵正则化学习动力学，激发 MARL 中的鲁棒协作与协调。
推导并分析熵正则化 A2C 的连续时间动力学（常微分方程，ODEs），以理解均衡、稳定性与分岔。
以两个范式的两状态博弈：Matching Pennies 与 Prisoner’s Dilemma 进行示例说明。
探讨 MARL 的协作机制与进化博弈理论（EGT）概念之间的联系。

提出的方法

在带 Boltzmann 动作选择和目标函数中加入熵项的随机博弈中表述熵正则化的 A2C。
推导两时间尺度的动力学，将相互作用、评论家更新和演员更新分离，得到确定性ODE。
用 Q 值、V 值及优势函数将策略更新表示为策略空间的动力学系统。
在策略单纯形上分析均衡；将内部均衡与量化反应均衡（QRE）联系起来。
应用动力系统工具对 MP 与 PD 进行分析，以表征混沌、多稳态性以及熵的稳定化作用。

实验结果

研究问题

RQ1熵正则化和两时间尺度学习如何影响随机博弈中多智能体 AC 动力学的稳定性与收敛性？
RQ2在代表性的两状态博弈（MP 与 PD）中，熵正则化 A2C 会产生哪些均衡，它们与 EGT 策略有何关系？
RQ3熵正则化是否抑制混沌、促进合作，以及它如何影响吸引盆（basins of attraction）？
RQ4MARL 动力学如何与进化博弈理论中的直接互惠机制相关联？

主要发现

在 Matching Pennies 无熵时，随着折扣因子增大，学习轨迹可能呈现混沌；但熵正则化抑制混沌并驱动收敛到公平合作。
在 Prisoner’s Dilemma 中，系统显现出三个稳定均衡，对应 ALLC、ALLD 和 GRIM；熵正则化扩大了合作的吸引盆。
MP 的内部合作均衡在熵的作用下变得全局可吸引，防止长期振荡并产生公平的结果。
PD 的分析显示出与经典 EGT 结果相类的直接互惠条件；熵充当类似突变的机制，强化合作。
正式建立了 EGT 中直接互惠与 MARL 出现合作之间的联系，通过推导的 ODE 动力学和 QRE 联系加以阐明。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。