QUICK REVIEW

[論文レビュー] Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

Dollaya Hirunyasiri, Danielle R. Thomas|arXiv (Cornell University)|Jul 5, 2023

Topic Modeling被引用数 8

ひとこと要約

この研究は、30の合成した教師-学生対話を用いて、ゼロショットおよび少数-shot のコグニティブ・プロセス（CoT） prompting を用い、GPT-4 の効果的なチューター称賛の5つの基準を人間の採点者と比較します。

ABSTRACT

Research suggests that providing specific and timely feedback to human tutors enhances their performance. However, it presents challenges due to the time-consuming nature of assessing tutor performance by human evaluators. Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. Nevertheless, the accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback. In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a tutor-student setting. We use two different prompting approaches, the zero-shot chain of thought and the few-shot chain of thought, to identify specific components of effective praise based on five criteria. These approaches are then compared to the results of human graders for accuracy. Our goal is to assess the extent to which GPT-4 can accurately identify each praise criterion. We found that both zero-shot and few-shot chain of thought approaches yield comparable results. GPT-4 performs moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperforms in identifying the tutor's ability to deliver sincere praise, particularly in the zero-shot prompting scenario where examples of sincere tutor praise statements were not provided. Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.

研究の動機と目的

チューターのパフォーマンスを向上させるためのタイムリーかつ形成的なフィードバックを促進する。
GPT-4 が効果的なチューター称賛の構成要素を正確に識別できるか評価する。
ゼロショットおよび少数ショット CoT prompting の信頼性と正確さを比較する。

提案手法

GPT-4 を用いて30件の合成の教師-学生対話を生成する。
3人の人間採点者に対し、称賛基準を5段階ルーブリック（誠実:Sincere、具体的:Specific、即時:Immediate、本物らしさ:Authentic、プロセス重視:Process-focused）でラベリングさせる。
過半数票を正解基準とし、評定者間信頼性（Fleiss’ Kappa）を算出する。
ゼロショット CoT と少数ショット CoT の prompting を用いて GPT-4 に称賛基準を識別させる。
GPT-4 の結果を人間のコンセンサスと、適合率・再現率・F1 で比較する。
ゼロショットと少数ショット prompting の介在関係を Cohen’s kappa で評価する。

実験結果

リサーチクエスチョン

RQ1GPT-4 は人間が識別した効果的なチューター称賛の構成要素を正確に評価できるか。
RQ2このタスクにおいて、ゼロショット CoT と少数ショット CoT prompting は正確さと信頼性の点でどのように比較されるか。
RQ3GPT-4 と人間が最も/最も少なく整合する称賛基準はどれか。

主な発見

Praise Criteria	Zero-shot CoT Precision	Zero-shot CoT Recall	Zero-shot CoT F1	Few-shot CoT Precision	Few-shot CoT Recall	Few-shot CoT F1
1-Sincere	0.37	1.00	0.54	0.50	1.00	0.67
2-Specific	0.75	0.92	0.83	0.85	0.85	0.85
3-Immediate	0.75	0.90	0.82	0.72	0.90	0.80
4-Authentic	0.60	1.00	0.75	0.63	0.83	0.71
5-Process-focused	1.00	0.50	0.67	1.00	0.50	0.67

GPT-4 は、ゼロショットでも少数ショットでも、具体的かつ即時の称賛を比較的うまく検出する（誠実さの F1 はゼロショットで約0.67、他はより高い）。
ゼロショットと少数ショットの CoT prompting は全体として類似した性能を示す。
GPT-4 は誠実さの識別が難しく、F1 が低め（0.54–0.67）で、文脈により判断が変わる例がある。
どちらの prompting 方法でも、本物らしさとプロセス重視の称賛に対して有意な一致を示す（Cohen’s kappa 約0.84–0.85）。
人間の採点者間の評定信頼性は、各基準で中程度から substantial の一致を示す（Fleiss’ Kappa 0.29–0.69）。
人間は誠実さおよびプロセス重視の基準でGPT-4よりも優れており、社会的・情動的判断のニュアンスが影響することを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。