QUICK REVIEW

[論文レビュー] Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma|arXiv (Cornell University)|May 7, 2024

Machine Learning and Data Classification被引用数 7

ひとこと要約

本論文は ASAG を提示します。多様な大学の試験データで訓練された自動短答評価の微調整済みオープンソース・トランスフォーマーシステムで、ベンチマーク研究において公式成績と人間の再採点者より一貫性が高いことを示しています。

ABSTRACT

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

研究の動機と目的

オープンエンド問の採点業務を軽減し、高等教育における人間の主観性と誤りを緩和する動機。
多様な多分野の試験データで訓練された微調整済みトランスフォーマーモデルを用いて、スケーラブルな ASAG システムを開発する。
未知の問題や科目への一般化を評価し、過去の試験データを用いてAI採点と人間の再採点者を比較する。

提案手法

Q、A_ref、max points、A のデータセットを用いて微調整した大規模オープンソース・トランスフォーマーモデルを用い、採点を予測する。
モデルは与えられたタプル (Q, A_ref, x_ref, A) に対して数値の成績を出力する入力/出力設計。
データを S_train、S_develop、S_test（S_test は unseen questions、S_test は unseen courses を含む）に分割して、監視付き微調整と評価を行う。
保持テストセットと正規化された成績で回帰指標（MAE、RMSE、Pearson 相関）を用いて性能を評価する。
人間ベンチマーク実験を実施し、16科目から1600問を再採点して公式成績からの偏差を比較する。

実験結果

リサーチクエスチョン

RQ1ASAG は未知の問題と未知の科目に一般化しつつ、人間の採点者と同等以上の精度を維持できるか？
RQ2ASAG は公式の過去の成績からの偏差を人間の再採点者より低く示すか、対照的なベンチマークで？
RQ3問題の難易度（最大ポイント）の影響は採点精度とモデル性能にどう影響するか？
RQ4未加工の成績と正規化された成績の評価で ASAG の性能はどうなるか？
RQ5高リスクな教育環境でのAIベースの採点を組み込む現実的な道はあるか？

主な発見

ASAG は未知の問題への良い一般化を達成（MAE ~1.32–1.44、RMSE ~2.27–2.41、分割間の相関 ~0.69–0.78）。
正規化された成績は、MAE が約 15.6–18.6 ポイント、相関が約 0.61–0.64 の範囲で安定した性能を示す。
16科目・1600問の人間ベンチマークでは、ASAG の RMSE（3.061 点）と平均偏差（0.183 ポイント）が公式成績と人間再採点者の4名よりも公式成績とより良く一致。RMSE 4.566点、平均偏差 0.289ポイント
公式成績からのモデルの偏差は、16科目中15科目で人間再採点者の偏差より小さい；人間再採点者と比較した場合の中央値絶対偏差は44%削減。
最大ポイントが高い問題（18）では訓練データの過小表現のため性能が低下する傾向。
著者は四段階の自律的採点フレームワークを提案、リスクと規制上の懸念を管理するため、初期展開を補助的な修正採点レベル（レベル1-2）に強調。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。