QUICK REVIEW

[论文解读] Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games

Ahmed Said Donmez, Yuksel Arslantas|arXiv (Cornell University)|Jan 31, 2026

Adaptive Dynamic Programming Control被引用 0

一句话总结

提出一种基于收益的去中心化三时尺度 actor-双 critic 学习框架，用于随机博弈，在两人零和和多智能体同利环境中收敛至近似均衡。

ABSTRACT

We propose a novel independent and payoff-based learning framework for stochastic games that is model-free, game-agnostic, and gradient-free. The learning dynamics follow a best-response-type actor-critic architecture, where agents update their strategies (actors) using feedback from two distinct critics: a fast critic that intuitively responds to observed payoffs under limited information, and a slow critic that deliberatively approximates the solution to the underlying dynamic programming problem. Crucially, the learning process relies on non-equilibrium adaptation through smoothed best responses to observed payoffs. We establish convergence to (approximate) equilibria in two-agent zero-sum and multi-agent identical-interest stochastic games over an infinite horizon. This provides one of the first payoff-based and fully decentralized learning algorithms with theoretical guarantees in both settings. Empirical results further validate the robustness and effectiveness of the proposed approach across both classes of games.

研究动机与目标

开发一个无模型、与游戏无关的随机博弈学习框架，信息最小化。
引入独立的、基于收益的三时尺度 actor-双 critic 架构。
在两人零和与多智能体同利博弈中建立收敛到近似纳什均衡的保证。
在探索与非均衡适应性下提供稳定性分析。
通过在两类博弈上的经验结果验证鲁棒性。

提出的方法

引入一个快 critic 以利用观测奖励快速估计局部 q 函数，慢 critic 通过固定点更新近似动态规划价值。
使用执行 epsilon-最优响应更新的 actor 给快 critic，从而实现无梯度的策略更新。
通过 epsilon-贪婪机制建模探索，并将其并入一个有效的随机博弈中，调整相应的奖励/转移。
采用带衰减步长的三时尺度随机逼近，以区分快 q 学习、策略更新和慢值估计。
在独立、基于收益的动态下证明两人零和与多智能体同利随机博弈的近似均衡收敛。
提供算法细节及与准单调性技术的理论联系以支撑收敛性。

实验结果

研究问题

RQ1独立代理仅依赖收益反馈和非均衡适应，是否能够在关键随机博弈类中收敛到均衡？
RQ2基于收益、无梯度的 actor-双 critic 动态是否能在两人零和和多智能体同利随机博弈中保证收敛到近似纳什均衡？
RQ3快速和慢速 critic 以及 epsilon-最优响应 actor 如何应对因策略变化引起的非平稳性？
RQ4探索对上述框架的收敛性与均衡近似有何影响？

主要发现

所提出的 actor-双 critic 动态在两人零和随机博弈中收敛到近似纳什均衡。
在多智能体同利随机博弈中，同一框架尽管值非唯一且不具备收缩性，仍能收敛到近似均衡。
在完全独立、基于收益的更新条件下，探索将均衡偏置控制在一个可控的 epsilon 基界内。
三时尺度更新（快 critic、actor、慢 critic）确保子问题的准稳定性，便于收敛性分析。
有效的随机博弈形式表明探索可被视为奖励和转移核的一部分，从而保留均衡推理。
实证结果验证了两种博弈类别中的鲁棒性和有效性。

(b) Identical-interest stochastic games.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。