[论文解读] ArguGPT: evaluating, understanding and identifying argumentative essays generated by GPT models
ArguGPT 提供一个大型、平衡语料库,包含 4,038 篇 GPT 生成的论文和 4,115 篇人工撰写的论证性文章,分析语言差异,并评估包括 RoBERTa 和 GPTZero 在内的检测器,总体在分布内检测表现强,但对分布外泛化能力有限。
AI generated content (AIGC) presents considerable challenge to educators around the world. Instructors need to be able to detect such text generated by large language models, either with the naked eye or with the help of some tools. There is also growing need to understand the lexical, syntactic and stylistic features of AIGC. To address these challenges in English language teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing tasks. Machine-generated texts are paired with roughly equal number of human-written essays with three score levels matched in essay prompts. We then hire English instructors to distinguish machine essays from human ones. Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them. But the number rises to 67% after one round of minimal self-training. Next, we perform linguistic analyses of these essays, which show that machines produce sentences with more complex syntactic structures while human essays tend to be lexically more complex. Finally, we test existing AIGC detectors and build our own detectors using SVMs and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of ArguGPT achieves above 90% accuracy in both essay- and sentence-level classification. To the best of our knowledge, this is the first comprehensive analysis of argumentative essays produced by generative large language models. Machine-authored essays in ArguGPT and our models will be made publicly available at https://github.com/huhailinguist/ArguGPT
研究动机与目标
- 为教育工作者建立识别由 GPT 模型撰写的 AI 生成的论证性论文的基线。
- 表征机器生成与人工撰写的论文之间的语言差异,聚焦句法与词汇。
- 评估现有的 AI 生成内容检测器,并使用机器学习模型开发鲁棒的检测器。
提出的方法
- 汇编一个平衡的 ArguGPT 语料库,包含 4,038 篇机器生成和 4,115 篇人工撰写的文章,回应来自 WECCL、TOEFL11 和 GRE 的提示。
- 使用七种 GPT 模型进行提示微调,生成机器论文,并在生成后进行筛选,去除短小、重复或重叠的文本。
- 对文本进行预处理以实现统一性,并评估人工评估者区分机器论文与人工论文的能力。
- 分析 31 种句法与词汇度量,以比较机器论文与人工论文。
- 在分布内数据上训练和评估检测器(SVM 和 RoBERTa),并在分布外数据上测试泛化能力。
- 创建并评估一个包含机器和人工论文的分布外数据集,以评估检测器的迁移学习。
实验结果
研究问题
- RQ1ESL 教师能否区分 GPT 生成的论证性论文和人工撰写的论文?
- RQ2在句法和词汇方面,哪些语言特征区分机器生成与人工撰写的论文?
- RQ3机器学习分类器是否能可靠地区分机器生成与人工撰写的论文,包括跨模型的泛化能力?
主要发现
- 教师在第一轮以 61.6% 的准确率正确识别机器论文与人工论文,经过简短训练后为 67.7%。
- 在语言学方面,GPT 论文表现出更复杂的句法,但在词汇方面比人工论文更不复杂。
- 在分布内数据上,基于在 ArguGPT 上微调的 RoBERTa 的检测器在论文层面和句子层面的准确率均超过 90%(论文层面 99%、句子层面 93%)。
- RoBERTa 能泛化到未见的模型(例如 claude-instant),而现成的检测器如 GPTZero 难以对分布外数据进行泛化。
- ArguGPT 及其检测器公开可用(GitHub 和 HuggingFace spaces)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。