QUICK REVIEW

[论文解读] Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Gati Aher, Rosa I. Arriaga|arXiv (Cornell University)|Aug 18, 2022

Topic Modeling被引用 123

一句话总结

这篇论文介绍了将 Turing Experiments (TEs) 用于评估大型语言模型在模拟多个人类参与者并在经济学、语言学、社会心理学和群体智慧领域再现经典人类受试者发现方面的表现，揭示了既忠实的模拟也存在扭曲。

ABSTRACT

We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a "hyper-accuracy distortion" present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

研究动机与目标

将 Turing Experiments (TEs) 定义为对受控研究中多个人类参与者的零-shot 仿真。
展示一种在语言模型上运行 TE 的方法学，使用提示和生成的记录。
再现经济学、心理语言学和社会心理学中的知名发现，并识别 LM 仿真中的扭曲。
评估模型规模如何影响保真度，并揭示各领域中的系统性扭曲。

提出的方法

引入一个零-shot TE 框架，其中一个 LM 生成对精心构建的提示的随机完成，以模拟一个参与者或多个参与者。
使用基于名字和人口统计信息的输入（称谓、姓氏、性别标记）来实例化多样化的模拟对象，并重建实验的类似抄本的记录。
设计并验证提示，以最大化完成项的‘有效性率’，通过将假设设计与结果测试分离来避免 p-hacking。
将 TE 框架应用于四个经典研究（Ultimatum Game、Garden Path Sentences、Milgram Shock Experiment、Wisdom of Crowds），使用多种 GPT-基模型和新颖的对照条件变体。
将 LM 派生的结果与既有的人类受试者结果进行比较，以评估保真度并识别扭曲，包括某些现代语言模型中的超高精度扭曲。

实验结果

研究问题

RQ1大型语言模型在多大程度上能够如实地模拟已建立实验中的代表性人类行为？
RQ2更大型的模型是否能再现人类研究中观察到的人口统计或性别相关效应？
RQ3在模拟不同领域（经济学、语言学、社会心理学、群体智慧）时，是否会出现系统性扭曲？
RQ4模型对齐和训练数据如何影响对 Wisdom of Crowds 等数值知识的模拟精度？

主要发现

较大型的模型在 Ultimatum Game、Garden Path、Milgram TE 的保真模拟通常优于较小的模型。
在 Ultimatum Game TE 中，模拟显示出性别和姓名相关的效应，与某些人类发现一致，包括性别搭配影响接受率的骑士道相关模式。
Garden Path TE 重现了人类对花园路径句子的基本解析困难，且在较大模型上尤为明显。
Milgram TE 昏服从性随复杂度增加呈现下降趋势，但也探索了一个新颖的破坏性顺从场景，Milgram 风格的结果与原始人类研究存在显著差异。
Wisdom of Crowds TE 显示了在最近的 GPT 模型（包括 ChatGPT 和 GPT-4）中的超高精度扭曲，即模拟个体对一些晦涩数量的近乎完美估计，凸显在教育或创造性应用中的潜在风险。
研究强调了有用扭曲（如降低性别偏见）与有问题的扭曲（对数值知识过于准确）在下游应用中的区别。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。