QUICK REVIEW

[论文解读] Exploring Durham University Physics exams with Large Language Models

Will Yeadon, D. P. Halliday|arXiv (Cornell University)|Jun 27, 2023

Artificial Intelligence in Healthcare and Education被引用 8

一句话总结

GPT-4 和 GPT-3.5 在 42 次达姆大学物理考试（593 道题，2504 分）上进行了评估，以评估 AI 能力和考试完整性；GPT-4 平均 49.4%，GPT-3.5 平均 38.6%，疫情后略有下降。

ABSTRACT

The emergence of advanced Natural Language Processing (NLP) models like ChatGPT has raised concerns among universities regarding AI-driven exam completion. This paper provides a comprehensive evaluation of the proficiency of GPT-4 and GPT-3.5 in answering a set of 42 exam papers derived from 10 distinct physics courses, administered at Durham University over the span of 2018 to 2022, totalling 593 questions and 2504 available marks. These exams, spanning both undergraduate and postgraduate levels, include traditional pre-COVID and adaptive COVID-era formats. Questions from the years 2018-2020 were designed for pre-COVID in person adjudicated examinations whereas the 2021-2022 exams were set for varying COVID-adapted conditions including open-book conditions. To ensure a fair evaluation of AI performances, the exams completed by AI were assessed by the original exam markers. However, due to staffing constraints, only the aforementioned 593 out of the total 1280 questions were marked. GPT-4 and GPT-3.5 scored an average of 49.4\% and 38.6\%, respectively, suggesting only the weaker students would potential improve their marks if using AI. For exams from the pre-COVID era, the average scores for GPT-4 and GPT-3.5 were 50.8\% and 41.6\%, respectively. However, post-COVID, these dropped to 47.5\% and 33.6\%. Thus contrary to expectations, the change to less fact-based questions in the COVID era did not significantly impact AI performance for the state-of-the-art models such as GPT-4. These findings suggest that while current AI models struggle with university-level Physics questions, an improving trend is observable. The code used for automated AI completion is made publicly available for further research.

研究动机与目标

激励并量化 AI 辅助完成大学物理考试的风险。
评估最先进的大型语言模型（GPT-4 和 GPT-3.5）在 2018–2022 年达姆物理考试中的表现。
提供透明、可重复的方法学和开源工具，便于复现与进一步研究。

提出的方法

使用正则表达式从讲座式 LaTeX 源文件中自动提取各道题。
使用 GPT-3.5 清理与纠错 LaTeX 错误，确保输入可编译。
向 OpenAI API 发送题目，系统提示为物理学教授角色并生成 LaTeX 格式的答案。
将 AI 输出汇编成每场考试的 PDF，由原课程的标记人员评分。
进行最多三次重试的迭代 LaTeX 编译检查；记录编译失败和题目特定的访问问题。
对提取的题目和答案进行人工核验以确保脚本可靠性；在 GitHub 上共享代码以实现可复制性。

实验结果

研究问题

RQ1GPT-4 和 GPT-3.5 是否在多门课程和不同等级的达姆大学物理考试中取得非平凡的分数？
RQ2疫情前（现场）与疫情后（开卷/远程适应）考试形式下，AI 性能有何差异？
RQ3AI 性能是否随考试等级（等级 1–4）或课程类型而异？
RQ4哪些因素与更高或更低的 AI 得分相关（如有无图形、请求解释、数学语言等）？

主要发现

GPT-4 在 593 道题中平均得分 49.4%，GPT-3.5 平均得分 38.6%。
疫情前平均为 50.8%（GPT-4）和 41.6%（GPT-3.5）。
疫情后平均为 47.5%（GPT-4）和 33.6%（GPT-3.5）。
GPT-4 在所有考试类型上都超越 GPT-3.5，对于物理学基础 3A 和理论天体物理学的结果更接近。
排除零分后，非零尝试的 AI 表现提升至 65.6%（GPT-4）和 56.7%（GPT-3.5）。
研究提供可复现实验的开源代码，并强调随着模型改进对 AI 风险的持续评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。