QUICK REVIEW

[论文解读] Out of One, Many: Using Language Models to Simulate Human Samples

Lisa P. Argyle, Ethan C. Busby|arXiv (Cornell University)|Sep 14, 2022

Computational and Text Analysis Methods被引用 92

一句话总结

本文展示了在以人口背景故事为条件时，GPT-3 能真实地模拟多样的人类亚人群；引入 silicon sampling，并在多项美国政治研究中显示出与人类数据的强一致性。

ABSTRACT

We propose and explore the possibility that language models can be studied as effective proxies for specific human sub-populations in social science research. Practical and research applications of artificial intelligence tools have sometimes been limited by problematic biases (such as racism or sexism), which are often treated as uniform properties of the models. We show that the "algorithmic bias" within one such tool -- the GPT-3 language model -- is instead both fine-grained and demographically correlated, meaning that proper conditioning will cause it to accurately emulate response distributions from a wide variety of human subgroups. We term this property "algorithmic fidelity" and explore its extent in GPT-3. We create "silicon samples" by conditioning the model on thousands of socio-demographic backstories from real human participants in multiple large surveys conducted in the United States. We then compare the silicon and human samples to demonstrate that the information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and socio-cultural context that characterize human attitudes. We suggest that language models with sufficient algorithmic fidelity thus constitute a novel and powerful tool to advance understanding of humans and society across a variety of disciplines.

研究动机与目标

概念化算法保真性，并为语言模型建立四个评估标准。
引入 silicon sampling 以纠正模型人口统计偏差并创建 silicon 主体。
证明将 GPT-3 条件化为人口统计背景故事可在政治领域产生类似人类的响应。
提供证据表明 GPT-3 能在有无人类数据之前或之前为理论生成与检验提供信息。

提出的方法

定义算法保真性及四个评估标准（社会科学图灵测试、向后连续性、向前连续性、模式对应）。
开发 silicon sampling，通过对已知背景故事（如 ANES 参与者）进行条件化，以调整训练数据中的人口统计偏斜。
为每个真人参与者创建 silicon 主体，并让 GPT-3 产出与人类执行相同任务的相应回答。
开展三项研究，将 GPT-3 的输出与政治和意见领域的人类数据进行比较，以评估跨领域的保真性。
使用条件化和消融分析来探索鲁棒性及模型比较。

实验结果

研究问题

RQ1GPT-3 是否能生成与描述政治党派立场的人类文本无法区分的输出（标准1）？
RQ2GPT-3 的输出是否反映输入的条件化和人口统计信息（标准2）？
RQ3GPT-3 的回应是否在向前上与条件背景和预期内容保持一致（标准3）？
RQ4GPT-3 的输出是否再现人类观察到的观念、态度与人口统计之间的关系（标准4）？

主要发现

年份	四分二项相关	比例一致性
2012	0.90	0.85
2016	0.92	0.87
2020	0.94	0.89

在 stem 研究中，GPT-3 的输出在目标任务上与人类文本基本无法区分（类似图灵测试的证据）。
评估显示 GPT-3 的回应与输入的态度和社会人口统计信息相一致（向后连续性）。
GPT-3 的回应随条件化以预期方式变化，并保持与情境一致的语气/内容（向前连续性）。
观察到强烈的模式对应：GPT-3 能再现人口统计、态度和行为之间的人类式关系，跨多个年份和子组有证据。
在 2012、2016、2020 年，GPT-3 与 ANES 投票选择的相关性显著，分分叉二元相关为 0.90、0.92、0.94，且比例一致性高（0.85、0.87、0.89）。
研究3显示 GPT-3 能再现态度与人口统计之间的复杂关联（Cramer's V 模式，均值差异较小）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。