QUICK REVIEW

[論文レビュー] Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters

Arne Bewersdorff, Kathrin Seßler|arXiv (Cornell University)|Aug 11, 2023

Online Learning and Analytics被引用数 8

ひとこと要約

要約: 本論文はGPT-3.5/4を用いたAIシステムを開発し、実験プロトコルの学生の誤りを自動的に特定し、人間の評価者と共通のエラーレーティング方式を用いて性能を比較する。

ABSTRACT

Identifying logical errors in complex, incomplete or even contradictory and overall heterogeneous data like students' experimentation protocols is challenging. Recognizing the limitations of current evaluation methods, we investigate the potential of Large Language Models (LLMs) for automatically identifying student errors and streamlining teacher assessments. Our aim is to provide a foundation for productive, personalized feedback. Using a dataset of 65 student protocols, an Artificial Intelligence (AI) system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. Our results indicate varying levels of accuracy in error detection between the AI system and human raters. The AI system can accurately identify many fundamental student errors, for instance, the AI system identifies when a student is focusing the hypothesis not on the dependent variable but solely on an expected observation (acc. = 0.90), when a student modifies the trials in an ongoing investigation (acc. = 1), and whether a student is conducting valid test trials (acc. = 0.82) reliably. The identification of other, usually more complex errors, like whether a student conducts a valid control trial (acc. = .60), poses a greater challenge. This research explores not only the utility of AI in educational settings, but also contributes to the understanding of the capabilities of LLMs in error detection in inquiry-based learning like experimentation.

研究の動機と目的

LLMベースのAIが実験プロトコルにおける一般的な学生の誤りを有効かつ信頼性高く特定できるかを調査する。
AIのエラー検出を、16種類の共通エラーの事前定義レーティング schemeを使用して人間の評価者とベンチマークする。
複数の指標において、人間同士および人間とAIの間のｲﾝﾀｰレイター信頼性を評価する。

提案手法

Chain-of-Thoughtおよびロール promptingを含む、ゼロショット及びFew-shot promptingを用いたGPT-3.5およびGPT-4でのAIシステムを開発する。
16の共通エラーを検出するため、公開済みのエラーレーティング schemeを基にAIを基礎付ける。
エラーチェック前に、独立変数/従属変数などの重要な実験要素を抽出するためにプロトコルを前処理する。
酵母および円錐実験からの65件のドイツ語学生プロトコルを評価に使用；訓練用25件、ｲﾝﾀｰレータ比較用40件。
三人の人間（R1–R3）と人間とAIの間でAccuracy、Cohen’s Kappa、Fleiss’ Kappa、Gwet’s AC1を用いてｲﾝﾀｰレータ一致度を計算する。

実験結果

リサーチクエスチョン

RQ1LLMベースのAIシステムは、人間の評価者と同等の精度で実験プロトコルにおける一般的な学生の誤りを特定できるか。
RQ2AIが信頼性高く検出するエラータイプと、依然として難しいエラータイプはどれか。
RQ3AIの性能は、AC1、Cohen’s Kappa、Fleiss’ Kappaなど複数の信頼性指標で人間の評価者とどのように比較されるか。

主な発見

AIは特定の基本的なエラーを正確に検出する（例：仮説を従属変数に焦点を当てる；acc = 0.90）。
AIは進行中の調査で試行を変更した場合を信頼性高く識別する（acc = 1）。
AIは学生が有効なテスト試行を実施したかどうかを信頼して判断できる（acc = 0.82）。
より複雑なエラー（例：有効な対照試行）の検出は難しく、正確さは約0.60程度。
三名の人間のｲﾝﾀｰレータ一致は、いくつかのエラーで高い正確さを示す一方、希少/一般的なエラーには大きなばらつきがある。いくつかのエラーでカッパ値が低くなる。
AIと人間の協調一致はエラータイプによって異なり、AIが一部のエラーで人間と同等以上の一致を示す一方、欠落した対照試行や特定のプロトコル記述などでは人間を下回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。