QUICK REVIEW

[论文解读] Shaping the learning signal in a combined Q-learning rule to improve structured cooperation

Chunpeng Du, Zongyang Li|arXiv (Cornell University)|Jan 29, 2026

Evolutionary Game Theory and Cooperation被引用 0

一句话总结

该论文显示在格子上的 Q-learning 强化信号中加入声誉，会促进合作，效果取决于学习率和折扣因子。

ABSTRACT

Q-learning provides a standard reinforcement learning framework for studying cooperation by specifying how agents update action values from repeated local interactions outcomes. Although previous work has shown that reputation can promote cooperation in such systems, most models introduce reputation by modifying payoffs, encoding it directly in the state or changing partner selection, which makes it difficult to isolate the role of the learning signal itself. Here, we construct the reinforcement signal as a weighted combination of reputation and game payoffs, leaving the game and network structure unchanged. We find that increasing the weight on reputation generally promotes cooperation by consolidating clusters, but this effect is conditional on the learning dynamics. Specifically, this promoting effect vanishes in two regimes: when the learning rate is extremely small, which prevents effective information propagation and when the discount factor approaches one, as distant future expectations obscure the immediate reputational advantage. Outside these limiting cases, the efficacy of reputation in promoting cooperation is attenuated by higher learning rates but amplified by larger discount factors. These results advance the understanding of cooperative dynamics by demonstrating that cooperation can be stabilized through the reputational shaping of learning signals alone, providing critical insights into the interplay between social information and individual learning parameters.

研究动机与目标

研究声誉信息化的强化信号如何影响空间格子上的合作动力学。
在促进合作的过程中，分离学习信号塑形的效应与收益或网络结构的影响。
分析学习率和折扣因子如何调制声誉对网络互惠性的影响。

提出的方法

使用一个具有 Von Neumann 邻域的方形格子和弱博弈论的 Prisoner’s Dilemma 就定价。
用 Q 表表示每个代理，并通过标准 Q-learning 更新，使用一个由归一化收益和声誉的加权混合组成的强化信号：Pi(t)=(1-β)·π_i(t)+β·R_i(t)。
声誉随行动以确定性方式演化：r_i(t+1)=r_i(t)+1 若合作，r_i(t+1)=r_i(t)-1 若背叛（限制在 [0,100]）。
在组合前将收益和声誉归一化到 [0,1]；用 Q(s,a) ← (1-α)Q(s,a) + α[Π_i(t) + γ max Q(s’,a’) ] 更新 Q 值。
采用 ε-greedy 探索和同步更新；进行 100,000 次蒙特卡洛步数；在最后的 5,000 步中测量合作水平 ρ_C。

Figure 2: The cooperation level on the parameter plane of $\alpha$ and reputation weight (a) and on the parameter plane of discount factor $\gamma$ and reputation weight (b). The color-coded stationary values of $\rho_{C}$ are indicated by the bar shown on the right-hand side. While the effect of pa

实验结果

研究问题

RQ1在固定博弈与网络的前提下，将声誉引入强化信号是否促进格子上的合作？
RQ2学习率 α 与折扣因子 γ 如何与声誉权重 β 相互作用以影响合作？
RQ3在声誉加权的 Q-learning 下，时空模式与微观转移动力学是怎样的？
RQ4是否存在声誉无法促进合作的参数区间？

主要发现

在博弈强度 b 下，合作随声誉权重 β 单调增加。
在两大极限区域：非常小的 α 或非常大的 γ，声誉的促成作用会消失。
在这两端之外，较大的 α 会削弱声誉的效果，而较大的 γ 会放大其作用，从而促进网络互惠性。
提高 β 会加速合作簇的巩固与扩散，增强网络互惠性。
时空模式显示较大 β 会产生更大规模的合作者区域，但相较于模仿更新，簇结构不那么紧凑。

Figure 3: The time evolution of spatial patterns at different reputation weight $\beta$ . From top to bottom, the values of $\beta$ are 0.0, 0.5 and 1.0. The snapshots were taken at time steps $T=0,1000,10000,50000$ and $99999$ . Defectors and cooperators are represented by blue and red cells, respe

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。