QUICK REVIEW

[论文解读] The Death of the Short-Form Physics Essay in the Coming AI Revolution

Will Yeadon, O. Inyang|arXiv (Cornell University)|Dec 22, 2022

Artificial Intelligence in Healthcare and Education被引用 25

一句话总结

本论文表明基于 OpenAI 的 GPT-3 模型能够生成五篇各自为 300-word 的物理论文，在 Durham University 模块中得分约为 71%，这表明 AI 撰写的短文作文对传统评估方法构成威胁。

ABSTRACT

The latest AI language modules can produce original, high quality full short-form ($300$-word) Physics essays within seconds. These technologies such as ChatGPT and davinci-003 are freely available to anyone with an internet connection. In this work, we present evidence of AI generated short-form essays achieving first-class grades on an essay writing assessment from an accredited, current university Physics module. The assessment requires students answer five open-ended questions with a short, $300$-word essay each. Fifty AI answers were generated to create ten submissions that were independently marked by five separate markers. The AI generated submissions achieved an average mark of $71 \pm 2 \%$, in strong agreement with the current module average of $71 \pm 5 %$. A typical AI submission would therefore most-likely be awarded a First Class, the highest classification available at UK universities. Plagiarism detection software returned a plagiarism score between $2 \pm 1$% (Grammarly) and $7 \pm 2$% (TurnitIn). We argue that these results indicate that current AI MLPs represent a significant threat to the fidelity of short-form essays as an assessment method in Physics courses.

研究动机与目标

引发对 AI 文本生成威胁短篇物理论文作为评估的准确性的担忧。
评估 AI 生成的短篇论文是否能够在真实的大学模块中达到一等成绩。
描述 AI 生成的论文相对于人类提交的论文在一致性与可检测性方面的特征。
讨论高等教育中评估设计的影响及潜在的缓解措施。

提出的方法

以 Durham University 的 Physics in Society 模块中的五个开放性物理问题（五篇 300-word 论文）作为评估基础。
使用 OpenAI davinci-003 playground，基于问题来生成十份 AI 撰写的提交（每份提交五个问题）。
请五名独立评卷者评阅 AI 提交，与模块平均分进行比较，并分析 Grammarly 与 Turnitin 的抄袭分数。
展示 AI 输出示例并讨论提示工程以获得抒发性、原创性回答。
评估评卷者之间的一致性，以及 AI 未来作为 tutor 或反馈提供者的潜在角色。

实验结果

研究问题

RQ1AI 语言模型是否能够生成在认证的大学评估中获得高分的短篇物理论文？
RQ2在平均分数和评分一致性方面，AI 生成的论文与人类学生的表现有何差异？
RQ3AI 撰写的论文是否能被标准的抄袭检测工具检测出来，它们在原创性和风格方面有哪些特征？
RQ4AI 能力对高等教育中的评估设计与学术诚信有何影响？

主要发现

十份 AI 生成的提交（每份含五个问题）在五位评卷者之间平均分为 71±2%。
该 AI 平均分与 Physics in Society 模块平均分 (71±5%) 以及 Durham 二年级物理模块平均分 (72±3%) 一致。
AI 论文在评卷者之间评分稳定，评卷者平均分分别为 73.0±1.6、72.6±2.0、69±2、70±2、以及 70.6±1.9，显示出强烈的评卷者间一致性。
AI 抄袭分数平均为 2±1%（Grammarly）和 7±2%（Turnitin），这表明 AI 编写的文本在超出所给问题的常见大学检测中看起来具有足够的原创性。
结果表明，当前的 AI 模型可以在短篇物理论文中达到一等水平，挑战短篇论文作为评估方法的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。