QUICK REVIEW

[论文解读] Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

Kathrin Seßler, Arne Bewersdorff|ArXiv.org|Feb 18, 2025

Intelligent Tutoring Systems and Adaptive Learning被引用 4

一句话总结

该研究将基于大语言模型（LLM）的反馈与教师和科学教育专家的反馈在学生实验协议上的质量进行比较，发现总体质量相似，但在情境性错误反馈方面LLM略有欠缺。

ABSTRACT

Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

研究动机与目标

开发一个LLM反馈代理，以检测学生实验协议中的错误并提供自适应反馈。
评估LLM生成的反馈相对于来自在岗教师和科学教育专家的反馈的质量。
使用真实学生数据 investigates六个方面的反馈质量（内容与语言相关）.
审校说明：此处按原文翻译，实际应为三条目标；如需要仅保留三条请告知。

提出的方法

使用零-shot提示开发一个LLM反馈代理，以逐步格式检测错误并提供自适应反馈。
收集来自6–8年级37名学生的40份学生协议，总计109处错误。
收集来自11名教师和5名科学教育专家的每个错误两份人工反馈文本作为基准。
用四名盲评者在六个标准上评估反馈文本：Feed Up、Feed Back、Feed Forward、Constructive Tone、Linguistic Clarity、Technical Terminology。
使用独立t检验比较组均值和方差，分析词数，并在不同反馈来源之间计算斯皮尔曼相关。

实验结果

研究问题

RQ1LLM为基础的反馈代理是否可以达到教师和专家对学生实验协议反馈的质量水平？
RQ2在反馈质量的哪些维度上，LLMs与人类反馈一致或存在差异？
RQ3不同来源的反馈类型在长度特征和相关性方面有何关系？

主要发现

LLM生成的反馈在总体质量方面与教师或专家反馈没有显著差异。
在“Feed Back”维度上存在显著差异，即人类在识别并解释情境中的错误方面优于LLM。
LLM反馈在语言相关维度（语气、清晰度、术语）通常得分较好，但在内容相关反馈，尤其是情境性错误识别方面落后。
LLM的反馈长度大致聚集在约50个词，与教师相似，而专家的反馈较长。
人类与LLM评分之间在内容相关方面的相关性较低，但在语言相关方面较高，表明不同来源具备不同优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。