Skip to main content
QUICK REVIEW

[论文解读] AI, write an essay for me: A large-scale comparison of human-written versus ChatGPT-generated essays

Steffen Herbold, Annette Hautli-Janisz|arXiv (Cornell University)|Apr 24, 2023
Artificial Intelligence in Healthcare and Education被引用 17
一句话总结

本研究系统性比较了人类撰写与 ChatGPT 生成的论证性论文,发现 ChatGPT(尤其是 GPT-4)在总体质量上优于人类,且在不同模型之间呈现出独特的语言模式。

ABSTRACT

Background: Recently, ChatGPT and similar generative AI models have attracted hundreds of millions of users and become part of the public discourse. Many believe that such models will disrupt society and will result in a significant change in the education system and information generation in the future. So far, this belief is based on either colloquial evidence or benchmarks from the owners of the models -- both lack scientific rigour. Objective: Through a large-scale study comparing human-written versus ChatGPT-generated argumentative student essays, we systematically assess the quality of the AI-generated content. Methods: A large corpus of essays was rated using standard criteria by a large number of human experts (teachers). We augment the analysis with a consideration of the linguistic characteristics of the generated essays. Results: Our results demonstrate that ChatGPT generates essays that are rated higher for quality than human-written essays. The writing style of the AI models exhibits linguistic characteristics that are different from those of the human-written essays, e.g., it is characterized by fewer discourse and epistemic markers, but more nominalizations and greater lexical diversity. Conclusions: Our results clearly demonstrate that models like ChatGPT outperform humans in generating argumentative essays. Since the technology is readily available for anyone to use, educators must act immediately. We must re-invent homework and develop teaching concepts that utilize these AI models in the same way as math utilized the calculator: teach the general concepts first and then use AI tools to free up time for other learning objectives.

研究动机与目标

  • 使用大量专家评分者(教师)来评估 AI 生成的论证性论文与人类撰写论文的质量。
  • 描述两种 ChatGPT 版本(GPT-3.5 与 GPT-4)之间人类与 AI 生成论文的语言差异。
  • 对论文质量进行统计学严谨分析,并进行可靠性检验与语言特征相关性分析。

提出的方法

  • 从在线论坛收集大量以 90 个主题为题的学生论文(人类撰写)。
  • 用一个基本的零-shot 提示对 ChatGPT-3 与 ChatGPT-4 进行提示,生成同一主题的大约 200 字论文。
  • 让 108 名教师在七个标准上对 270 篇论文中的 658 份评价进行评分,采用七点李克特量表并计算评注者之间的可靠性。
  • 对词汇多样性、句法复杂性、名词化、情态动词、知识性标记和话语标记进行计算语言学分析。
  • 对多重比较使用 Wilcoxon 符号秩检验并进行 Holm-Bonferroni 校正,并报告效应量 Cohen’s d;采用基于自助法的置信区间。
  • 使用可用的复制包重复分析。

实验结果

研究问题

  • RQ1RQ1: 基于 GPT-3 与 GPT-4 的 ChatGPT 在撰写学生论证性论文方面有多好?
  • RQ2RQ2: AI 生成的论文与人类撰写的论文相比如何?
  • RQ3RQ3: 人类与 AI 生成内容的语言手段有什么特征?

主要发现

  • ChatGPT 生成的论文在所有标准上的质量评分均高于人类撰写的论文,且 GPT-4 的表现优于 GPT-3.5。
  • GPT-4 在逻辑结构、语言复杂性、词汇丰富性和文本连贯性方面相比 GPT-3.5 更高。
  • 人类使用更多情态动词和知识性标记,而 GPT 模型使用更多名词化且句子复杂度更高。
  • 语言多样性随时间提升,GPT-4 显示的多样性高于人类,而 GPT-3.5 的多样性落后于人类。
  • GPT-4 与 GPT-3.5 之间的差异在逻辑、词汇连结和复杂性方面显著,表明 GPT-4 的整体改进。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。