QUICK REVIEW

[論文レビュー] Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability Analysis against Human Performance

Abdolvahab Khademi|arXiv (Cornell University)|Apr 9, 2023

Topic Modeling被引用数 13

ひとこと要約

本論文は、ChatGPTとBardが作成課題の複雑さを評価する際に人間の評価者と一致するかを、信頼性指標としてICCを用いて評価し、人間のゴールド標準に対して低い相互信頼性を見いだしたかを検証する。

ABSTRACT

ChatGPT and Bard are AI chatbots based on Large Language Models (LLM) that are slated to promise different applications in diverse areas. In education, these AI technologies have been tested for applications in assessment and teaching. In assessment, AI has long been used in automated essay scoring and automated item generation. One psychometric property that these tools must have to assist or replace humans in assessment is high reliability in terms of agreement between AI scores and human raters. In this paper, we measure the reliability of OpenAI ChatGP and Google Bard LLMs tools against experienced and trained humans in perceiving and rating the complexity of writing prompts. Intraclass correlation (ICC) as a performance metric showed that the inter-reliability of both the OpenAI ChatGPT and the Google Bard were low against the gold standard of human ratings.

研究の動機と目的

教育分野におけるAI生成評価項目の品質を研究する動機づけ。
経験豊富な人間の評価者に対するAIツール（ChatGPTとBard）の信頼性を調査する。
ライティングプロンプトの複雑さを知覚・評価する人間の性能との整合性を評価する。

提案手法

主要な信頼性指標としてIntraclass Correlation Coefficient (ICC) を用いる。
OpenAI ChatGPTとGoogle Bardをゴールドスタンダードと見なされる人間の評価と比較する。
経験豊富で訓練を受けた人間の評価者を起用して、ライティングプロンプトの複雑さを評価する。
人間の合意に対するAIツールの性能を分析し、信頼性を判断する。

実験結果

リサーチクエスチョン

RQ1ChatGPTとBardは、プロンプトの複雑さを判断する際に人間の評価と高いICCを得られるか？
RQ2ChatGPTとBardのICC値は、人間のゴールドスタンダードと比べてどのように比較されるか？
RQ3このタスクで、AI生成の評価は人間の評価者を補助または置換するほど信頼できるか？

主な発見

ICCベースの信頼性は、ChatGPTとBardの両方で人間の評価に対して低い。
本研究は、プロンプトの複雑さ評価のゴールドスタンダードとして人間の評価を用いる。
この文脈における、これらのLLMが整合的な評価項目生成に対して限定的な信頼性を示すことを示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。