QUICK REVIEW

[论文解读] Learning gain differences between ChatGPT and human tutor generated algebra hints

Zachary A. Pardos, Shreya Bhandari|arXiv (Cornell University)|Feb 14, 2023

Intelligent Tutoring Systems and Adaptive Learning被引用 72

一句话总结

研究比较来自 ChatGPT 生成的代数提示与人工导师提示的学习增益，发现人工提示产生更高的增益，且约 30% 的 ChatGPT 提示因质量问题被拒绝。

ABSTRACT

Large Language Models (LLMs), such as ChatGPT, are quickly advancing AI to the frontiers of practical consumer use and leading industries to re-evaluate how they allocate resources for content production. Authoring of open educational resources and hint content within adaptive tutoring systems is labor intensive. Should LLMs like ChatGPT produce educational content on par with human-authored content, the implications would be significant for further scaling of computer tutoring system approaches. In this paper, we conduct the first learning gain evaluation of ChatGPT by comparing the efficacy of its hints with hints authored by human tutors with 77 participants across two algebra topic areas, Elementary Algebra and Intermediate Algebra. We find that 70% of hints produced by ChatGPT passed our manual quality checks and that both human and ChatGPT conditions produced positive learning gains. However, gains were only statistically significant for human tutor created hints. Learning gains from human-created hints were substantially and statistically significantly higher than ChatGPT hints in both topic areas, though ChatGPT participants in the Intermediate Algebra experiment were near ceiling and not even with the control at pre-test. We discuss the limitations of our study and suggest several future directions for the field. Problem and hint content used in the experiment is provided for replicability.

研究动机与目标

评估 ChatGPT 生成的提示是否能够在促进代数学习增益方面与人工导师提示相匹配。
评估 ChatGPT 生成的代数题提示的质量和可靠性。
提供可复制的内容和方法，供未来基于大型语言模型的辅导提示工作使用。

提出的方法

两选一、2x2 实验设计，涵盖初等代数和中级代数课程。
使用 2022 年 12 月 15 日模型、以 OATutor 内容中的题目提示为来源生成的 ChatGPT 提示。
使用 OpenStax 派生内容的人工导师提示作为对照条件。
每位参与者进行三道前测、五道获取阶段题目及三道后测（前后测使用相同题目）。
质量检查：正确答案、正确的解题步骤，以及没有不当语言；若任一检查不通过即被取消资格。

实验结果

研究问题

RQ1RQ1：ChatGPT 产出低质量提示的频率有多高？
RQ2RQ2：ChatGPT 提示是否会带来学习增益？
RQ3RQ3：ChatGPT 提示在学习增益方面与人工导师提示相比如何？

主要发现

所有条件均产生学习增益，但统计显著性仅在人工提示条件下达到。
初等与中级代数中，人工提示的学习增益高于 ChatGPT 提示。
在中级代数中，ChatGPT 组在前测接近天花板（约80%），且与后测增益无显著差异；对照组在两科目上仍显著不同于前测。
ChatGPT 提示因质量问题（答案错误或步骤不正确）被拒绝率为 30%。
各条件花费时间相近，但由于提示数量受限且经过质量筛选，ChatGPT 需要的提示更少。

Figure 2. Manually generated hint example

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。