QUICK REVIEW

[论文解读] Large language models can segment narrative events similarly to humans

Sebastian Michelmann, M. Kumar|arXiv (Cornell University)|Jan 24, 2023

Topic Modeling被引用 10

一句话总结

GPT-3 能将连续叙事分割为文本中的离散事件，其界限与人类共识显著对齐，且通常比个别人工标注者更接近共识。

ABSTRACT

Humans perceive discrete events such as "restaurant visits" and "train rides" in their continuous experience. One important prerequisite for studying human event perception is the ability of researchers to quantify when one event ends and another begins. Typically, this information is derived by aggregating behavioral annotations from several observers. Here we present an alternative computational approach where event boundaries are derived using a large language model, GPT-3, instead of using human annotations. We demonstrate that GPT-3 can segment continuous narrative text into events. GPT-3-annotated events are significantly correlated with human event annotations. Furthermore, these GPT-derived annotations achieve a good approximation of the "consensus" solution (obtained by averaging across human annotations); the boundaries identified by GPT-3 are closer to the consensus, on average, than boundaries identified by individual human annotators. This finding suggests that GPT-3 provides a feasible solution for automated event annotations, and it demonstrates a further parallel between human cognition and prediction in large language models. In the future, GPT-3 may thereby help to elucidate the principles underlying human event perception.

研究动机与目标

推动可扩展、自动化的自然主义叙事中的事件分段。
评估一个大型语言模型是否能识别与人类注释可比的事件边界。
评估GPT-3派生边界与人类共识以及个体注释之间的关系。
从模型输出中提供事件边界概率的连续度量，并将其与人类一致性进行比较。
提供代码以实现可重复性并在认知科学研究中推广。

提出的方法

用逐字逐句的提示对GPT-3（text-davinci-002）进行提示，将故事分割成事件。
使用滑动窗口以适应GPT-3的上下文长度，覆盖三篇长度不同的故事。
从换行符（newline tokens）中提取事件边界，并通过标记对齐和动态时间规整将它们映射到逐字稿时间轴。
基于换行符的对数概率计算连续事件边界概率，并外推到人类时序。
使用汉明距离和置换检验将GPT-3边界与人类共识进行比较。
评估跨故事的GPT-3边界概率与人类按钮按压概率之间的互相关。

实验结果

研究问题

RQ1GPT-3 能否将叙事文本分割成离散事件，其边界与人类事件边界对齐？
RQ2GPT-3派生的边界是否比个体人类标注者更接近共识解？
RQ3模型派生的连续边界概率是否与人类一致性相关？
RQ4分割成较长事件如何影响与人类共识的对齐？
RQ5GPT-3边界与不同长度的多篇故事中的人类边界相比如何？

主要发现

GPT-3 将三篇故事分成事件，计数各不相同（例如 Pieman: 23 个事件；Monkey in the Middle: 88；Tunnel Under the World: 139）并显示与共识人类标注显著对齐（例如 Hamming 距离约为 0.245–0.264，p 值 < 0.05）。
将提示设为长事件会减少边界数量（Pieman: 14；Monkey in the Middle: 59；Tunnel Under the World: 76），并且与共识的对齐更近（较低的 Hamming 距离，某些 p 值 < 0.01）。
GPT-3 派生的边界平均上更接近人类共识，而非个体人类标注，且在多组比较中结果显著（例如 Pieman 第一次运行：GPT-3 距离 0.261 vs 人类 0.281，p=0.045）。
来自 GPT-3 的连续边界概率（换行的对数概率）与人类连续边界一致性显著相关（零延迟相关在 Pieman 第二次运行最高达到 r=0.362，p<0.001）。
跨故事来看，GPT-3 边界比个别参与者更接近共识解，支持将 GPT-3 作为可扩展的自动注释工具，用于事件分段。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。