QUICK REVIEW

[论文解读] Comparing Code Explanations Created by Students and Large Language Models

Juho Leinonen, Paul Denny|arXiv (Cornell University)|Apr 8, 2023

Software Engineering Research参考文献 40被引用 10

一句话总结

该研究比较学生自制的代码解释与在大型CS1课程中由GPT-3生成的解释，发现LLM解释更准确、更易理解，长度相似。

ABSTRACT

Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, developing the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course ($n \approx 1000$) with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education.

研究动机与目标

推动在大型CS课堂中扩大代码解释脚手架的必要性。
调查LLM生成的代码解释是否可以在准确性和可理解性方面达到或超过学生创建的解释。
考察学生认为哪些方面的代码解释最有用，以及对解释的评价方式。
评估LLM解释是否可以作为新手学习解释代码的可扩展典范。

提出的方法

使用一个大型一年级课程（约1000名学生）来收集三个函数的解释。
让学生为三个函数创建解释（实验A），然后对来自学生和GPT-3的随机样本54个解释进行评估（实验B）。
通过三个5点李克特量表的问题来评估解释：易于理解、摘要的准确性，以及理想长度。
以字符数比较长度以建立基线差异。
对来源之间的差异应用Mann–Whitney U检验并进行Bonferroni校正。
对学生开放式回答进行主题分析，以识别解释中的被重视特性。

实验结果

研究问题

RQ1RQ1 学生和LLM创建的代码解释在准确性、长度和可理解性方面在多大程度上存在差异？
RQ2RQ2 学生重视代码解释的哪些方面？

主要发现

LLM生成的解释被评为比学生创建的解释更准确。
LLM生成的解释被评为比学生创建的解释更易于理解。
在感知长度和实际长度方面，学生与LLM生成的解释之间没有统计显著差异。
纠正后，来源之间的理想长度评分没有显著差异。
在开放回应中，学生更偏好逐行解释，并重视能指明输入/输出并描述代码目标的解释。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。