QUICK REVIEW

[论文解读] Using GPT-4 to Augment Unbalanced Data for Automatic Scoring

Luyang Fang, Gyeong-Geon Lee|arXiv (Cornell University)|Oct 25, 2023

Topic Modeling被引用 9

一句话总结

该论文使用 GPT-4 生成少数类学生回答来平衡不平衡数据集，并对 DistilBERT 进行微调以进行自动评分，相较于非增强数据和金标准增强，在准确率、精确度、召回率和 F1 上均有提升。

ABSTRACT

Machine learning-based automatic scoring faces challenges with unbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model, specifically tailored for unbalanced datasets in automatic scoring. Our experimental dataset comprised student written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the data set. We then finetuned DistillBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 metrics. Our findings revealed that incorporating GPT-4-augmented data remarkedly improved model performance, particularly for precision and F1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was needed to obtain stable improvement for automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models match those trained with student data. This research underscores the potential and effectiveness of data augmentation techniques utilizing generative large language models like GPT-4 in addressing unbalanced datasets within automated assessment.

研究动机与目标

在学生科学解释的自动评分中解决回答分布不平衡的问题。
探索 GPT-4 提示增强以强化少数评分类别。
用增强数据与原始数据及金标准增强进行比较，评估评分性能。
评估增强比例对模型指标和稳定性的影响。

提出的方法

构建两个具有高度不平衡的少数类（Q1 和 Q2）的科学题目数据集。
为少数类样本生成 GPT-4 增强回答以平衡数据。
在增强数据和原始数据集上微调 DistilBERT 以进行自动评分。
将数据划分为训练/验证/测试集，测试集中少数类表示增加。
在不同增强比例（0–100%）下评估模型的准确率、精确度、召回率和 F1。
将 GPT-4 增强数据与金标准（额外的真实学生回答）增强进行比较。

实验结果

研究问题

RQ1 GPT-4 增强的训练数据在多大程度上提升评分性能？
RQ2 基于 GPT-4 的数据增强在提升评分模型性能方面有多高效？
RQ3 GPT-4 基于的数据增强与使用额外学生撰写回答相比如何？

主要发现

GPT-4 增强在精确度、召回率和 F1 方面有提升，两个题目上的平均最大提升分别为 3.5%（准确率）、30.6%（精确度）、21.1%（召回）和 24.2%（F1）。
仅使用 5% 的增强数据便可获得显著提升：平均提高 2.6% 的准确率、29.2% 的精确度、15.1% 的召回和 19.6% 的 F1。
任务相关的改进因数据集而异；提升取决于数据特征和增强水平。
增强数据的模型通常达到或超过使用学生撰写的增强数据训练的模型，在准确率差异约为 1.7%、精确度约为 1.9%、召回约为 11.0%、F1 约为 7.8% 的有利于 GPT-4 增强的情况。
对于任务 1，随着初始增强，精确度/召回/F1 显著改善，且在 5–20% 增强后趋于稳定。
对于任务 2，带有增强的数据下准确率保持较高（天花板效应），而召回和 F1 随着更多增强数据的加入在接近约 40% 时趋于饱和。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。