QUICK REVIEW

[論文レビュー] Evaluating Students' Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Jussi S. Jauhiainen, Agustín Garagorry Guerra|arXiv (Cornell University)|May 8, 2024

Intelligent Tutoring Systems and Adaptive Learning被引用数 5

ひとこと要約

この論文は、RAGフレームワーク下で、10ショット、2つの温度条件のもと、GPT-3.5、GPT-4、Claude-3、Mistral-Large のLLMを用いて開放回答を評価し、モデル間で合計4,320件の評価を行う。

ABSTRACT

Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. In our study, we explore the effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in assessing university students' open-ended answers to questions made about reference material they have studied. Each model was instructed to evaluate 54 answers repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used as the framework to make the LLMs to process the evaluation of the answers. As of spring 2024, our analysis revealed notable variations in consistency and the grading outcomes provided by studied LLMs. There is a need to comprehend strengths and weaknesses of LLMs in educational settings for evaluating open-ended written responses. Further comparative research is essential to determine the accuracy and cost-effectiveness of using LLMs for educational assessments.

研究の動機と目的

学生の開放回答を評価する際のLLMの一貫性と採点結果を評価する。
RAGフレームワーク下での温度設定と繰り返し評価の影響を調査する。
教育でのLLMベース評価の強み・弱み・留意点について洞察を提供する。

提案手法

Retrieval Augmented Generation (RAG)フレームワークを使用して評価を処理する。
モデルあたり54件の学生回答を、2条件（10-shot、温度0.0および0.5）で評価する。
各条件を10回繰り返して評価を実施し、モデルあたり計1,080件、4モデル合計で4,320件の評価を達成する。

実験結果

リサーチクエスチョン

RQ1RAG下でLLMは開放回答を採点または評価する際にどの程度一貫しているか？
RQ2温度設定（0.0対0.5）が評価結果に与える影響は何か？
RQ3教育評価タスクにおけるGPT-3.5、GPT-4、Claude-3、Mistral-Largeの比較的強みと弱みは何か？

主な発見

研究対象のLLM間で一貫性に顕著なばらつきが存在する。
モデル間で採点結果に観測される差がある。
教育における開放的な記述回答の評価におけるLLMの強みと弱みを理解する必要性を浮き彫りにしている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。