QUICK REVIEW

[论文解读] The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks

Anders Giovanni Møller, Jacob Aarup Dalsgaard|arXiv (Cornell University)|Apr 26, 2023

Machine Learning and Data Classification被引用 26

一句话总结

本文比较了十个 CSS 分类任务中人工标注数据与经过 GPT-4 和 Llama-2 增强的数据的差异，结果显示人类通常表现更好，但大模型的增强在稀有类别和复杂任务中有帮助；零-shot 的大模型在多数情况下不及使用有标签数据训练的模型。

ABSTRACT

In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.

研究动机与目标

为计算社会科学从业者在分类任务中何时依赖人工标注与何时使用 LLM 生成的增强提供可操作的指南。
评估在不同复杂度和类别平衡的任务中，以人工标注数据训练的模型与以 LLM 增强数据训练的模型的性能。
评估 GPT-4 与 Llama-2 的零-shot 性能相对于在不同数据源上训练的有监督模型的表现。

提出的方法

通过对每个任务以10%作为基础众包数据集来模拟低资源标注。
以人工标注的补充或由 GPT-4 或 Llama-2 70B Chat 生成的合成样本来增强基础集（每个基础样本9个合成样本）。
使用固定架构（intfloat/e5-base，110M 参数）并采用 AdamW 优化器训练10轮；在保留的测试集上通过 macro F1 和准确率进行评估。
平衡增强：在进行合成增强之前，对基础集中少数类别进行过采样以解决类别不平衡。
在所有任务中使用相同的提示，将模型与 GPT-4 和 Llama-2 70B Chat 的零-shot 分类进行比较。

实验结果

研究问题

RQ1在不同复杂度的任务中，使用人工标注数据训练的模型与使用 LLM 生成的增强数据训练的模型的性能如何变化？
RQ2相较于众包数据，LLM 生成的增强在多类任务的稀有类别上是否提升了性能？
RQ3在十个 CSS 分类任务中，零-shot 的 LLM 性能与有标签数据训练的模型相比如何？

主要发现

人工标注模型在二元平衡任务以及某些多类别平衡任务上通常优于合成增强模型和零-shot 模型。
LLM 增强主要在复杂、非平衡的多类别任务以及稀有类别上带来收益，有时甚至超过众包数据。
零-shot 表现取决于任务，通常被中等规模标注或合成增强数据集训练的模型超越；GPT-4 和 Llama-2 在不同任务上展现出不同的优势。
Llama-2 的合成数据在词汇多样性方面可能比 GPT-4 数据更丰富，影响某些任务（如情感）的表现。
在稀有类别的真实样本难以获得时，合成增强可能尤为有价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。