QUICK REVIEW

[论文解读] ChatGPT: Jack of all trades, master of none

Jan Kocoń, Igor Cichecki|arXiv (Cornell University)|Feb 21, 2023

Topic Modeling被引用 15

一句话总结

本文自动评估 ChatGPT 与 GPT-4 在 25 个多样化的 NLP 任务（语义与语用）上的表现，与 SOTA 进行比较，研究个性化，并揭示偏见。

ABSTRACT

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.

研究动机与目标

评估 ChatGPT 与 GPT-4 在广泛的 NLP 分析任务上的表现，包括主观与客观问题。
将 ChatGPT 与 GPT-4 的结果与 state-of-the-art (SOTA) 基准进行对比。
研究个性化与上下文对预测的影响。
分析模型输出中的潜在偏见和道德考量。
提供对评估方法学以及大语言模型对社会有用性的启示。

提出的方法

在 25 个公开 NLP 数据集上进行自动化提示，覆盖语义与语用任务的 ChatGPT 和 GPT-4。
对模型输出进行人工评估，关注任务正确性而非文本质量或风格。
结合数据集报道的指标和重新实现的基线，与可用的 SOTA 结果进行比较。
纳入个性化提示（Random Contextual Few-Shot Personalization）以测试主观任务。
通过标注数据集来源来评估模型偏见和潜在的训练数据泄露。
对后处理需求和模型局限性进行定性分析。

实验结果

研究问题

RQ1Q1: 在不同任务类型上，ChatGPT 相对于 SOTA 的性能损失是否不同，以及 GPT-4 的表现如何？
RQ2Q3: Random Contextual Few-Shot Personalization 是否能提升主观推理与整体推理质量？
RQ3Q4: 在处理多个相关提示时，上下文如何影响答案？
RQ4Q6: 在所评估的任务中，GPT-4 是更好还是更差于 ChatGPT？
RQ5Q8: 哪些后处理步骤可以在分析任务中提升 ChatGPT 的输出质量？

主要发现

ID	Task Name	Category	Language	NLP Problem	Context	Reasoning Type	Dataset / SOTA	Availability	Trained	#Test	#Used	#None	#Post-processed	#N	#Classes
1	Aggression	P	EN	Offensiveness detection	No	Binary classification	WikiDetox Aggr. [62] / [63]	3	Yes	23153	1000	13	151	(15.1%)	987
2	AggressionPer	P	EN	Offensiveness det.: personalized	Yes	Binary classification	WikiDetox Aggr. [62] / [21]	2	No	349582	1000	19	92	(9.2%)	981
2	CoLa	S	EN	Linguistic acceptability	No	Binary classification	CoLA [64] / [65]	3	Yes	1042	1042	0	0 (0%)	1042
2	ColBERT	P	EN	Humor recognition	No	Binary classification	ColBERT [66] / [66]	2	No	40000	1000	5	93	(9.3%)	995
2	Sarcasm	P	EN	Humor recognition	No	Binary classification	Sarcasmania [67] / [68]	3	Yes	5967	1000	10	61	(6.1%)	990
2	Spam	P	EN	Spam detection	No	Binary classification	SMS Spam v.1 [69] / [70]	3	Yes	1115	1115	3	14	(1.3%)	1112
2	WordContext	S	EN	Word sense disambiguation	Yes	Binary pair classification	WiC [71] / [72]	3	No	638	638	0	5	(0.8%)	638
2	TextEntail	S	EN	Natural language inference	No	Binary sentence pair classification	RTE [73] / [72]	3	Yes	277	277	0	0 (0%)	277
2	WNLI	S	EN	Natural language inference	No	Binary sentence pair classification	WNLI [74] / [75]	3	Yes	71	71	0	0 (0%)	71
2	SQuAD	S	EN	Question answering	Yes	Extractive QA	SQuAD v2 [76] / [77]	3	Yes	11873	1000	0	247	(24.7%)	1000
2	MathQA	S	EN	Question answering	No	Mathematical reasoning	GSM8K [78] / [79]	3	Yes	1319	1000	0	1	(0.1%)	999
12	ClarinEmo	P	PL	Emotion recognition	No	Multi-label classification	ClarinEmo - / -	0	No	1264	1264	0	9	(0.7%)	1264
13	GoEmo	P	EN	Emotion recognition	No	Multi-label classification	GoEmotions [80] / [81]	3	No	5427	1000	18	87	(8.7%)	1000
14	GoEmoPer0	P	EN	Emotion rec.: personalized	No	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	28	1	(0.1%)	1123
15	GoEmoPer1	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	11	0	(0%)	1140
16	GoEmoPer2	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
17	GoEmoPer3	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
18	Unhealthy	P	EN	Offensiveness detection	No	Multi-label classification	Unhealthy Conv. [82] / [82]	3	No	44354	1000	22	348	(34.8%)	963
19	UnhealthyPer	P	EN	Offensiveness det.: personalized	Yes	Multi-label classification	Unhealthy Conv. [82] / [20]	2	No	227975	1000	9	15	(1.5%)	991
20	PolEmo	P	PL	Sentiment analysis	No	Multiclass classification	PolEmo2 [83] / [83]	1	No	820	820	3	23	(2.8%)	817
21	TweetEmoji	P	EN	Emoji prediction	No	Multiclass classification	TweetEval [84] / [85]	2	No	50000	1666	2	0	(0%)	1664
22	TweetSent	P	EN	Sentiment analysis	No	Multiclass classification	TweetEval [84] / [85]	2	No	12283	5143	0	245	(4.8%)	5143
23	TweetStance	S	EN	Stance detection	No	Multiclass classification	TweetEval [84] / [85]	2	No	1249	1249	7	99	(7.9%)	1249
24	ReAding	S	EN	Question answering	Yes	Multiple choice QA	RACE [86] / [87]	3	Yes	4887	1000	4	206	(20.6%)	996
25	WSD	S	EN	Word sense disambiguation	Yes	Sequence labeling	Raganato [88] / [89]	3	Yes	7253	7253	5	176	(2.4%)	7253

ChatGPT 在多数任务的零-shot 与少样本设置下，相对于 SOTA 的平均性能损失约为 ~25%。
GPT-4 在语义任务上呈现显著较小的语义损失，相较于 ChatGPT 更优。
个性化、情境化的提示（Random Contextual Few-Shot Personalization）能显著提升更符合用户的预测。
在更难的任务（SOTA 基线较低的任务）上表现衰减更明显，尤其在语用任务如情感识别。
存在 ChatGPT 输出的偏见证据，可能源于人工训练师指南和系统策略。
后处理与特定任务的提示能在分析任务的结果上带来适度改善。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。