QUICK REVIEW

[論文レビュー] Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

Kathrin Seßler, Arne Bewersdorff|ArXiv.org|Feb 18, 2025

Intelligent Tutoring Systems and Adaptive Learning被引用数 4

ひとこと要約

要約：研究は学生の実験プロトコルに対するLLM生成フィードバックを教师および科学教育専門家のフィードバックと比較し、全体的な品質は類似している一方で文脈的エラーのフィードバックでLLMが遅れを取ることを発見した。

ABSTRACT

Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

研究の動機と目的

学生の実験プロトコルの誤りを検出し適応的なフィードバックを提供するLLMフィードバックエージェントを開発する。
LLM生成フィードバックの品質を実務教師および科学教育専門家のフィードバックと比較して評価する。
実データを用いてフィードバック品質の6つの次元（内容および言語関連）を調査する。

提案手法

ゼロショットプロンプトを用いて誤りを検出し、段階的な形式で適応的フィードバックを提供するLLMフィードバックエージェントを開発した。
6–8学年の37人の生徒から40の学生プロトコルと109の誤りを収集した。
11人の教師および5人の科学教育専門家からベンチマークとして誤りごとに2つの人間フィードバック文を収集した。
四人の盲検評価者が6つの評価基準（Feed Up、Feed Back、Feed Forward、Constructive Tone、Linguistic Clarity、Technical Terminology）でフィードバック文を評価した。
グループ平均・分散を独立標本t検定で比較し、語数を分析し、各フィードバック源間のスぺアマン相関を算出した。

実験結果

リサーチクエスチョン

RQ1LLMベースのフィードバックエージェントは学生の実験プロトコルに対する教師および専門家のフィードバックの品質に匹敵できるか。
RQ2LLMsはフィードバックのどの次元で人間のフィードバックと一致するか、または乖離するか。
RQ3フィードバックタイプ間の長さ特性およびソース間の相関はどうなるか。

主な発見

LLM生成フィードバックは全体的な品質で教師または専門家のフィードバックと有意差を示さなかった。
Feed Back次元で有意な差が見られ、文脈での誤りを特定・説明する点で人間がLLMより優れていた。
LLMのフィードバックは言語関連の次元（Tone、Clarity、Terminology）で概ね高く評価されたが、内容関連のフィードバック、特に文脈的誤りの特定には遅れをとった。
LLMのフィードバック長は約50語前後に集まり、教師と同等、専門家はより長いフィードバックを作成した。
人間とLLMの評価の相関は内容関連の側面で低かったが、言語関連の側面では高く、ソースごとに強みが異なることを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。