QUICK REVIEW

[论文解读] Can We Trust AI-Generated Educational Content? Comparative Analysis of Human and AI-Generated Learning Resources

Paul Denny, Hassan Khosravi|arXiv (Cornell University)|Jun 18, 2023

Online Learning and Analytics被引用 18

一句话总结

该研究比较AI生成与学生生成的学习资源在入门编程课程中的效果，使用盲评学生评估来显示可感知质量的可比性。AI生成的内容模仿示例并在长度和语法使用方面有所不同。

ABSTRACT

As an increasing number of students move to online learning platforms that deliver personalized learning experiences, there is a great need for the production of high-quality educational content. Large language models (LLMs) appear to offer a promising solution to the rapid creation of learning materials at scale, reducing the burden on instructors. In this study, we investigated the potential for LLMs to produce learning resources in an introductory programming context, by comparing the quality of the resources generated by an LLM with those created by students as part of a learnersourcing activity. Using a blind evaluation, students rated the correctness and helpfulness of resources generated by AI and their peers, after both were initially provided with identical exemplars. Our results show that the quality of AI-generated resources, as perceived by students, is equivalent to the quality of resources generated by their peers. This suggests that AI-generated resources may serve as viable supplementary material in certain contexts. Resources generated by LLMs tend to closely mirror the given exemplars, whereas student-generated resources exhibit greater variety in terms of content length and specific syntax features used. The study highlights the need for further research exploring different types of learning resources and a broader range of subject areas, and understanding the long-term impact of AI-generated resources on learning outcomes.

研究动机与目标

在在线个性化学习环境中，激发对可扩展且高质量教育资源的需求。
研究AI生成的资源在正确性和有用性方面是否能够与学生生成的内容相匹配。
在相同示例下评估AI与人类创建资源在结构和风格方面的差异。
提供AI生成内容在计算机教育中作为补充资源的潜在作用的证据。

提出的方法

使用六个示例学习资源来对学生和一个大型语言模型（Codex）进行引导，以生成新的代码示例和说明。
使用不同示例集的少样本提示生成100个AI生成的代码示例。
将AI和学生资源插入一个共享存储库，并由学生进行盲评同侪评估。
对定量比较应用Mann-Whitney U检验，对分类数据应用卡方检验。
通过保留的C关键字、代码与解释长度以及各组的关键字使用情况来比较资源的全面性。

实验结果

研究问题

RQ1RQ1：在相同的引导示例下，学生生成的资源与AI生成的资源在总体长度和语法特征的存在方面有何差异？
RQ2RQ2：当资源是学生生成与AI生成时，学生如何评定其正确性和有用性？

主要发现

基于正确性和有用性评定，AI生成的资源被认为与学生生成的资源具有同等质量。
AI生成的代码平均长度往往短于学生生成的代码，而AI解释则长于学生解释。
AI生成的资源模仿给定的示例，而学生生成的资源在长度和语法特征方面变化更大。
在保留关键字使用方面（例如int、while、for、return），AI、教师和学生之间的差异具有统计显著性，表明各组之间存在不同的编码模式。
在总体质量决策或评审者信心方面，AI与学生生成资源之间未发现统计学显著差异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。