QUICK REVIEW

[论文解读] NERIF: GPT-4V for Automatic Scoring of Drawn Models

Gyeong-Geon Lee, Xiaoming Zhaı|arXiv (Cornell University)|Nov 21, 2023

Genetics, Bioinformatics, and Biomedical Research被引用 9

一句话总结

本文提出 NERIF，一种利用 GPT-4V 的图像处理与语言能力，通过提示设计结合 instructional notes 和 rubrics，对学生绘制的科学模型进行自动评分的少样本学习方法，实现中等水平的测试准确度和可解释的评分。

ABSTRACT

Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.

研究动机与目标

在科学教育中自动对学生绘制的模型进行评分的必要性，以节省时间并提供及时反馈。
开发一种基于提示的方法（NERIF），利用 GPT-4V 的图像处理与语言能力对绘制的模型进行评分。
在六个建模任务中评估 GPT-4V 的表现，与人类专家评分进行对比。
展示教学注释和评分标准如何实现可解释、可解释的评分结果。

提出的方法

采用 9 个示例评估的少样本学习方法，通过提示让 GPT-4V 进行三项分类（ beginnings、developing、proficient 的三段等级）。
每个问题提供两张附加图片：带评分示例的题意上下文与学生绘制的模型；从提示中随机检索一个示例来引导评分。
结合 Notation-Enhanced Scoring Rubrics，包含三个组成部分：评分维度、熟练度规则和教学注释。
进行验证（N=54）以迭代改进提示；随后在测试评分阶段（N=900）按照贪心解码（温度 0，top_p 0.01）运行。
用准确率、精确率、召回率、F1，以及 Fleiss’ Kappa 进行评估；通过混淆矩阵分析误分类情况。

实验结果

研究问题

RQ1GPT-4V 自动对学生绘制的模型评分的准确性如何？
RQ2GPT-4V 如何利用提供的评分标准和注释自动为学生绘制的模型打分？

主要发现

Item	Accuracy	Acc_Beg	Acc_Dev	Acc_Prof	Precision	Recall	F1	Kappa
R1-1	0.50	0.50	0.66	0.34	0.56	0.50	0.50	0.44
J2-1	0.45	0.68	0.56	0.12	0.62	0.45	0.41	0.32
M3-1	0.53	0.82	0.40	0.36	0.53	0.53	0.51	0.51
H4-1	0.57	0.64	0.68	0.38	0.61	0.57	0.56	0.51
H5-1	0.47	0.62	0.58	0.22	0.53	0.47	0.46	0.43
J6-1	0.53	0.62	0.84	0.12	0.62	0.53	0.48	0.38

六项任务的平均测试评分准确度为 0.51（SD = 0.037）。
六项任务的平均精确度、召回率与 F1 分别为 0.58、0.51 和 0.49；Fleiss’ Kappa 的范围为 0.32 到 0.51（公平到中等）。
按类别的准确度：Beginning 0.64，Developing 0.61，Proficient 0.26，表明 Proficient 对 GPT-4V 来说更具挑战性。
六项任务的验证准确度平均为 0.67（Beginning 0.78，Developing 0.67，Proficient 0.56）。
GPT-4V 能从输入图像中检索题意上下文与评分范例，并为评分组成部分生成自然语言推理。
结果显示加入示例演示（Few-shot 提示）和教学注释会提高评分质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。