QUICK REVIEW

[论文解读] Inequity aversion improves cooperation in intertemporal social dilemmas

Edward Hughes, Joel Z. Leibo|arXiv (Cornell University)|Mar 23, 2018

Experimental Behavioral Economics Studies被引用 76

一句话总结

作者将不平等厌恶偏好扩展到马尔可夫博弈中的多智能体强化学习，并且表明有利的不平等厌恶促进跨时社会困境中的合作，而不利的不平等厌恶在某些情境通过惩罚来发挥作用。

ABSTRACT

Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.

研究动机与目标

推动对时延更长的社会困境中合作的研究，超越静态矩阵博弈。
将不平等厌恶偏好推广到多智能体强化学习设置下的序列性马尔可夫博弈。
研究不平等厌恶如何影响学习与策略形成以促进合作。
探讨不平等厌恶如何影响时序信用分配与合作行为的出现。

提出的方法

模型是一个部分可观测的马尔可夫博弈，多个代理通过各自的观测与奖励独立学习。
使用带神经网络的异步优势行动者-评论家（A3C）为每个代理学习策略。
引入按玩家的奖励时序平滑以在序列设置中实现不平等厌恶（内在奖励）。
将 Fehr–Schmidt 不平等厌恶模型扩展到马尔可夫博弈，并引入不利与有利不平等厌恶的参数。
使用经验型 Schelling 图和两个网格世界游戏（Cleanup 与 Harvest）将环境验证为社会困境。
考察另外两个游戏（Dictate apples, Give apples, Take apples）以在简单的双人设置中说明不平等厌恶行为。

实验结果

研究问题

RQ1不平等厌恶偏好是否能够从无状态的矩阵博弈扩展到序列性、具有多智能体的马尔可夫博弈？
RQ2有利与不利的不平等厌恶是否在跨时社会困境中促进合作，且在什么条件下？
RQ3不平等厌恶如何影响时序信用分配与多智能体强化学习中的学习动态？
RQ4特定环境（公共品困境与公地困境）是否受到不平等厌恶激励的影响存在差异？

主要发现

有利不平等厌恶在 Cleanup 公共品博弈中改善集体结果与合作，在 Harvest 也有帮助，通过改善时序信用分配。
不利不平等厌恶通过惩罚和激励时机在 Harvest 公地博弈中促进合作，即使只有单个代理具备此特征。
基线 A3C 代理未能实现社会收益，而具备不平等厌恶的代理在某些情境下显示出如合作与可持续性等社会指标的改善。
推迟不平等厌恶的内在奖励会降低其效果，凸显及时的内在反馈在学习合作策略中的作用。
效果具有任务条件性：有利的不平等厌恶在公共品困境中特别有效，而不利的不平等厌恶在公地困境中更强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。