QUICK REVIEW

[论文解读] Large Language Models in Introductory Programming Education: ChatGPT's Performance and Implications for Assessments

Natalie Kiesler, Daniel Schiffner|arXiv (Cornell University)|Aug 15, 2023

Artificial Intelligence in Healthcare and Education被引用 19

一句话总结

本文在72个 CodingBat Python 任务上评估了 ChatGPT-3.5 和 GPT-4，发现正确率约为 94.4%–95.8% 且普遍提供解释和代码，随后讨论对教学与评估的影响。

ABSTRACT

This paper investigates the performance of the Large Language Models (LLMs) ChatGPT-3.5 and GPT-4 in solving introductory programming tasks. Based on the performance, implications for didactic scenarios and assessment formats utilizing LLMs are derived. For the analysis, 72 Python tasks for novice programmers were selected from the free site CodingBat. Full task descriptions were used as input to the LLMs, while the generated replies were evaluated using CodingBat's unit tests. In addition, the general availability of textual explanations and program code was analyzed. The results show high scores of 94.4 to 95.8% correct responses and reliable availability of textual explanations and program code, which opens new ways to incorporate LLMs into programming education and assessment.

研究动机与目标

评估 ChatGPT-3.5 和 GPT-4 生成正确、可执行的 Python 代码以完成初学者编程任务的能力。
分析输出中是否包含文本解释和代码，以及它们的可靠性。
讨论在入门编程教育中利用大型语言模型的教学情境与评估形式。

提出的方法

以 72 个跨 8 个领域的 CodingBat Python 任务作为输入，测试 ChatGPT-3.5 和 GPT-4。
向模型呈现完整的任务描述，并通过 CodingBat 单元测试评估输出。
记录回答是否包含代码、解释，以及代码是否通过单元测试；如有需要，迭代提示。
分析限制因素，如任务清晰度、约束条件（无库、函数签名）以及语言模型的过度自信等。

实验结果

研究问题

RQ1ChatGPT-3.5 和 GPT-4 在入门编程任务中的正确性表现如何？
RQ2这些模型提供的文本解释和代码的程度如何？
RQ3对入门编程教育中的教学设计与评估有哪些影响？

主要发现

CodingBat 任务区域	GPT-3.5 文本解释	GPT-3.5 程序代码	GPT-3.5 正确的单元测试结果	GPT-4 文本解释	GPT-4 程序代码	GPT-4 正确的单元测试结果
Warmup1	11/12	12/12	12/12	12/12	12/12	12/12
Warmup2	9/9	9/9	9/9	9/9	9/9	9/9
String1	11/11	11/11	10/11	11/11	11/11	11/11
List1	12/12	12/12	12/12	11/12	12/12	12/12
Logic1	8/9	9/9	8/9	9/9	9/9	9/9
Logic2	7/7	7/7	6/7	6/7	7/7	6/7
String2	6/6	6/6	6/6	6/6	6/6	4/6
List2	6/6	6/6	6/6	6/6	6/6	5/6

ChatGPT-3.5 在 72 道任务中正确解决了 69 道（95.8%）。
GPT-4 在 72 道任务中正确解决了 68 道（94.4%）。
两种模型在初始回答中提供 Python 代码，并在大多数情况下给出文本解释（GPT-3.5：70/72 的解释；GPT-4：70/72 的解释）。
代码通常包含注释，有时还包含额外的示例输出。
八任务领域的分布显示在大多数领域的正确性非常高，对任务之间存在小幅差异。
作者讨论了通过提示策略来改进或精炼输出，并强调由于潜在的歧义性和过度自信而需谨慎。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。