QUICK REVIEW

[论文解读] Simulating Human Strategic Behavior: Comparing Single and Multi-agent LLMs

Karthik Sreedhar, Lydia Chilton|arXiv (Cornell University)|Feb 13, 2024

Artificial Intelligence in Law被引用 9

一句话总结

本文比较单一与多智能体的 LLM 架构（GPT-3.5/4）在模拟五轮最后通牒博弈策略中，具贪婪与公平性格的人格。多智能体 LLM 更常与人类行为对齐（最高达到 87.5%），而单一 LLM（最低至 42.5%）。

ABSTRACT

When creating policies, plans, or designs for people, it is challenging for designers to foresee all of the ways in which people may reason and behave. Recently, Large Language Models (LLMs) have been shown to be able to simulate human reasoning. We extend this work by measuring LLMs ability to simulate strategic reasoning in the ultimatum game, a classic economics bargaining experiment. Experimental evidence shows human strategic reasoning is complex; people will often choose to punish other players to enforce social norms even at personal expense. We test if LLMs can replicate this behavior in simulation, comparing two structures: single LLMs and multi-agent systems. We compare their abilities to (1) simulate human-like reasoning in the ultimatum game, (2) simulate two player personalities, greedy and fair, and (3) create robust strategies that are logically complete and consistent with personality. Our evaluation shows that multi-agent systems are more accurate than single LLMs (88 percent vs. 50 percent) in simulating human reasoning and actions for personality pairs. Thus, there is potential to use LLMs to simulate human strategic reasoning to help decision and policy-makers perform preliminary explorations of how people behave in systems.

研究动机与目标

激发设计研究者在为人们制定计划、政策或应用时，预想到人类的策略性行为。
评估基于LLM的仿真是否能捕捉经济博弈中的社会规范和策略性推理。
比较单一 vs 多智能体 LLM 架构在生成完整、与人格一致的策略以及在游戏中的表现。
评估策略的鲁棒性并识别仿真中的常见误差源。

提出的方法

使用两种人格类型：贪婪型和公平型，进行五轮最后通牒博弈仿真。
实现两种LLM架构：一个 GPT-4 单一提示用于两名玩家，以及一个多智能体设置，对提议者和接受者分别使用独立的 GPT-4 代理。
提示设计让代理人制定策略，然后按这些策略进行五轮对局。
将输出与从经济学文献中提取的关于提议、接受/拒绝决策的人类基线数据进行比较评估。
分析误差以区分策略生成问题与游戏进行中的偏差。
在四种架构/性格组合下测试 GPT-3.5 与 GPT-4，总计 40 次仿真。

Figure 1. An output log from a SingleLLM simulation of two fair players playing five rounds of the ultimatum game. All text and indentation is from the LLM, bold text added by the authors to highlight strategy and gameplay actions.

实验结果

研究问题

RQ1RQ1：在五轮最后通牒博弈中，哪种架构（单一 vs 多智能体）更准确地模拟人类般的行动？
RQ2RQ2：哪种架构在模拟与两种人格类型（贪婪型和公平型）相对应的行动方面更准确？
RQ3RQ3：哪种架构更常地产生完整且与人格一致的鲁棒策略？

主要发现

多智能体 LLM 与人类行为的一致性更高（最佳设置为 87.5%），高于单智能体 LLM（最低为 42.5%）
在各架构中，多智能体设置更稳定地产生与人类行为一致的行动（MultiAgent-3.5: 82.5%；MultiAgent-4: 87.5%；SingleLLM-3.5: 42.5%；SingleLLM-4: 50%）。
单一LLM仿真中的大多数错误来自策略创建，而非游戏过程中的偏差（例如，策略不完整或与人格不一致）。
在多智能体仿真中，错误主要是策略与人格不一致，但对预期人格的对齐度更高。
鲁棒性分析显示，多智能体配置，尤其是 MultiAgent-4，在更大比例的仿真中产生完整且与人格一致的策略（高达 87.5%）。
研究提供证据表明，多智能体 LLM 架构能够更好地映射复杂的人类策略行为，可能帮助设计者在探索政策或界面效应时。

Figure 2. An output log from a Multi-Agent simulation of two fair players playing five rounds of the ultimatum game. All text is from the LLM; the labels (underlined) are provided by the architecture.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。