QUICK REVIEW

[论文解读] Can Large Language Models Transform Computational Social Science?

Caleb Ziems, William A. Held|arXiv (Cornell University)|Apr 12, 2023

Topic Modeling被引用 61

一句话总结

这篇论文在 25 个 CSS 任务上基准测试了 13 个 LLMs 的零-shot 性能，发现 LLMs 很少超过微调分类器，但可以提供较为公平的标注和有用的生成，暗示一个人机协作的 CSS 工作流。

ABSTRACT

Large Language Models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the Computational Social Science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers' gold references. We conclude that the performance of today's LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.

研究动机与目标

调研 CSS 文献，识别 LLMs 能帮助分析的任务。
在代表性的 CSS 任务集上评估多种 LLMs 的零-shot 性能。
分析模型规模和预训练对 CSS 任务性能的影响。
提供一个关于在 CSS 标注和分析中人机协作的实用路线图。

提出的方法

策划 24 个覆盖话语、对话与文档层面的多样化 CSS 任务。
在这些任务中对 13 个语言模型进行零-shot 提示评估。
将零-shot 结果与人工标注以及在可用时与微调基线进行比较。
制定 CSS 任务的提示最佳实践和评估流程。
进行生成为评估 LLMs 的解释性与重构能力的任务。

实验结果

研究问题

RQ1RQ1 可行性：LLMs 能否以可靠的标注来增强人类标注？
RQ2RQ2 模型选择：模型规模和预训练如何影响 CSS 任务性能？
RQ3RQ3 领域实用性：零-shot LLMs 在某些 CSS 领域的表现是否优于其他领域？
RQ4RQ4 功能性：LLMs 是否更适合标注（分类）任务、生成（解释）任务，或两者俱具？

主要发现

经过提示的 LLMs 通常无法达到或超越经过精细微调的分类器，但可以与人类标注达到相当的一致性。
在若干任务中，模型规模的增大提升了性能，表明具有作为辅助而非替代的潜力。
LLMs 能生成在质量、连贯性和相关性方面达到或超过数据集参考的解释。
人类与 LLM 的输出互为补充，人类在大约一半时间偏好模型输出。
所提出的监督–无监督混合标注方法可以加速并提升 CSS 文本分析。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。