QUICK REVIEW

[论文解读] Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark

Michael Reiss|arXiv (Cornell University)|Apr 17, 2023

Artificial Intelligence in Healthcare and Education被引用 31

一句话总结

本文分析了 ChatGPT 在文本注释和分类方面的零-shot 可靠性，显示输出在提示、温度和重复之间可能不一致，并建议在无监督使用中保持谨慎并进行验证。

ABSTRACT

Recent studies have demonstrated promising potential of ChatGPT for various text annotation and classification tasks. However, ChatGPT is non-deterministic which means that, as with human coders, identical input can lead to different outputs. Given this, it seems appropriate to test the reliability of ChatGPT. Therefore, this study investigates the consistency of ChatGPT's zero-shot capabilities for text annotation and classification, focusing on different model parameters, prompt variations, and repetitions of identical inputs. Based on the real-world classification task of differentiating website texts into news and not news, results show that consistency in ChatGPT's classification output can fall short of scientific thresholds for reliability. For example, even minor wording alterations in prompts or repeating the identical input can lead to varying outputs. Although pooling outputs from multiple repetitions can improve reliability, this study advises caution when using ChatGPT for zero-shot text annotation and underscores the need for thorough validation, such as comparison against human-annotated data. The unsupervised application of ChatGPT for text annotation and classification is not recommended.

研究动机与目标

评估 ChatGPT 在实际任务“新闻 vs 不是新闻”中的文本注释和分类的零-shot 可靠性。
考察模型参数（temperature）、提示变体与重复输入如何影响一致性。
评估通过重复聚合输出是否能将可靠性提升到科学可接受的阈值。
强调在自动注释软件中使用 ChatGPT 的影响及进行全面验证的必要性。

提出的方法

通过 OpenAI API 使用 gpt-3.5-turbo 将 234 段德语网站文本分类为 News 或 Not News。
创建十种不同的指令（提示变体），基于手工编码的代码簿和更短的备选方案。
在 46,800 个输入上测试两种温度设置（0.25 和 1）（2340 个提示 × 10 次重复 × 2 种温度）。
通过 Krippendorff’s Alpha 测量一致性，覆盖：(i) 不进行汇聚，(ii) 三次重复的多数表决，(iii) 十次重复的多数表决。
比较不同提示和对同一输入的重复输出，以评估提示内和提示间的一致性。

实验结果

研究问题

RQ1在同一输入下，不同提示之间的 ChatGPT 分类有多一致？
RQ2温度设置如何影响 ChatGPT 零-shot 注释的可靠性？
RQ3对多次重复输出进行汇聚是否能提高可靠性，程度如何？
RQ4在同一配置下重复相同输入时，是否存在有意义的一致性？
RQ5在无监督文本注释工作流中使用 ChatGPT 的含义是什么？

主要发现

在不进行汇聚时，两个温度设置之间的一致性可能低于可靠性阈值（Alpha = 0.75）。
对相同提示在不同温度下进行十次重复的汇聚，使一致性提升到 Alpha = 0.91。
不同的指令措辞导致低一致性（Alpha 不超过 0.6），无论是否汇聚。
在相同输入的重复中，较低温度可得到更高的一致性（Alpha > 0.9）；较高温度在最强情形下 Alpha 约为 0.85。
总体而言，零-shot 分类可能不可靠，需要与人工标注数据进行验证；不建议无监督使用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。