QUICK REVIEW

[논문 리뷰] ChatGPT: Jack of all trades, master of none

Jan Kocoń, Igor Cichecki|arXiv (Cornell University)|2023. 02. 21.

Topic Modeling인용 수 15

한 줄 요약

본 논문은 25개의 다양한 NLP 작업(의미적 및 화용적)을 대상으로 ChatGPT와 GPT-4의 평가를 자동화하여 SOTA와의 성능 차를 비교하고, 개인화 연구를 수행하며 편향을 밝힌다.

ABSTRACT

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.

연구 동기 및 목표

주관적 문제와 객관적 문제를 포함한 광범위한 NLP 분석 과제에서 ChatGPT와 GPT-4의 성능을 평가한다.
SOTA 벤치마크와의 비교에서 ChatGPT와 GPT-4의 결과를 평가한다.
예측에 대한 개인화와 맥락의 영향을 조사한다.
모델 응답에서의 잠재적 편향과 윤리적 고려사항을 분석한다.
평가 방법론과 대형언어모델(LLMs)의 사회적 유용성에 대한 시사점을 제시한다.

제안 방법

의미적 및 화용적 과제를 포함하는 25개의 공개 NLP 데이터셋에 대해 ChatGPT와 GPT-4의 자동화된 프롬프트를 수행한다.
텍스트 품질이나 스타일이 아닌 작업의 정답 여부를 기준으로 모델 출력의 수동 평가를 수행한다.
데이터셋 보고 지표와 재구현된 기준선으로 사용 가능한 SOTA 결과와의 비교를 수행한다.
주관적 작업을 테스트하기 위한 개인화된 프롬프트(Random Contextual Few-Shot Personalization)를 포함한다.
데이터셋 출처를 주석 처리하여 모델 편향과 훈련 데이터 누출 가능성을 평가한다.
후처리 필요성과 모델 한계에 대한 질적 분석을 수행한다.

실험 결과

연구 질문

RQ1Q1: 다양한 과제 유형에서 ChatGPT의 SOTA 대비 성능 저하가 다른가, 그리고 GPT-4는 어떻게 비교되는가?
RQ2Q3: Random Contextual Few-Shot Personalization이 주관적 추론과 전체 추론 품질을 향상시킬 수 있는가?
RQ3Q4: 여러 관련 프롬프트를 처리할 때 맥락이 답변에 어떤 영향을 미치는가?
RQ4Q6: 평가 대상과제에서 GPT-4가 ChatGPT보다 낫는가, 못하는가?
RQ5Q8: 분석 작업에서 ChatGPT 출력 품질을 개선할 수 있는 후처리 단계는 무엇인가?

주요 결과

ID	Task Name	Category	Language	NLP Problem	Context	Reasoning Type	Dataset / SOTA	Availability	Trained	#Test	#Used	#None	#Post-processed	#N	#Classes
1	Aggression	P	EN	Offensiveness detection	No	Binary classification	WikiDetox Aggr. [62] / [63]	3	Yes	23153	1000	13	151	(15.1%)	987
2	AggressionPer	P	EN	Offensiveness det.: personalized	Yes	Binary classification	WikiDetox Aggr. [62] / [21]	2	No	349582	1000	19	92	(9.2%)	981
2	CoLa	S	EN	Linguistic acceptability	No	Binary classification	CoLA [64] / [65]	3	Yes	1042	1042	0	0 (0%)	1042
2	ColBERT	P	EN	Humor recognition	No	Binary classification	ColBERT [66] / [66]	2	No	40000	1000	5	93	(9.3%)	995
2	Sarcasm	P	EN	Humor recognition	No	Binary classification	Sarcasmania [67] / [68]	3	Yes	5967	1000	10	61	(6.1%)	990
2	Spam	P	EN	Spam detection	No	Binary classification	SMS Spam v.1 [69] / [70]	3	Yes	1115	1115	3	14	(1.3%)	1112
2	WordContext	S	EN	Word sense disambiguation	Yes	Binary pair classification	WiC [71] / [72]	3	No	638	638	0	5	(0.8%)	638
2	TextEntail	S	EN	Natural language inference	No	Binary sentence pair classification	RTE [73] / [72]	3	Yes	277	277	0	0 (0%)	277
2	WNLI	S	EN	Natural language inference	No	Binary sentence pair classification	WNLI [74] / [75]	3	Yes	71	71	0	0 (0%)	71
2	SQuAD	S	EN	Question answering	Yes	Extractive QA	SQuAD v2 [76] / [77]	3	Yes	11873	1000	0	247	(24.7%)	1000
2	MathQA	S	EN	Question answering	No	Mathematical reasoning	GSM8K [78] / [79]	3	Yes	1319	1000	0	1	(0.1%)	999
12	ClarinEmo	P	PL	Emotion recognition	No	Multi-label classification	ClarinEmo - / -	0	No	1264	1264	0	9	(0.7%)	1264
13	GoEmo	P	EN	Emotion recognition	No	Multi-label classification	GoEmotions [80] / [81]	3	No	5427	1000	18	87	(8.7%)	1000
14	GoEmoPer0	P	EN	Emotion rec.: personalized	No	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	28	1	(0.1%)	1123
15	GoEmoPer1	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	11	0	(0%)	1140
16	GoEmoPer2	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
17	GoEmoPer3	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
18	Unhealthy	P	EN	Offensiveness detection	No	Multi-label classification	Unhealthy Conv. [82] / [82]	3	No	44354	1000	22	348	(34.8%)	963
19	UnhealthyPer	P	EN	Offensiveness det.: personalized	Yes	Multi-label classification	Unhealthy Conv. [82] / [20]	2	No	227975	1000	9	15	(1.5%)	991
20	PolEmo	P	PL	Sentiment analysis	No	Multiclass classification	PolEmo2 [83] / [83]	1	No	820	820	3	23	(2.8%)	817
21	TweetEmoji	P	EN	Emoji prediction	No	Multiclass classification	TweetEval [84] / [85]	2	No	50000	1666	2	0	(0%)	1664
22	TweetSent	P	EN	Sentiment analysis	No	Multiclass classification	TweetEval [84] / [85]	2	No	12283	5143	0	245	(4.8%)	5143
23	TweetStance	S	EN	Stance detection	No	Multiclass classification	TweetEval [84] / [85]	2	No	1249	1249	7	99	(7.9%)	1249
24	ReAding	S	EN	Question answering	Yes	Multiple choice QA	RACE [86] / [87]	3	Yes	4887	1000	4	206	(20.6%)	996
25	WSD	S	EN	Word sense disambiguation	Yes	Sequence labeling	Raganato [88] / [89]	3	Yes	7253	7253	5	176	(2.4%)	7253

ChatGPT는 제로샷 및 파샷 설정에서 다수의 과제에서 SOTA 대비 평균 약 25%의 성능 손실을 보인다.
GPT-4는 의미적 과제에서 ChatGPT에 비해 의미 손실이 상당히 작다.
개인화되고 맥락을 반영한 프롬프트(Random Contextual Few-Shot Personalization)가 사용자와의 예측 일치를 크게 향상시킨다.
난이도가 높은 과제에서 성능 저하가 더 크게 나타나며(더 낮은 SOTA 기반), 특히 감정 인식과 같은 화용적 과제에서 두드러진다.
ChatGPT 출력에서 편향의 증거가 있으며 이는 인간 트레이너 지침과 시스템 정책에서 기인했을 가능성이 있다.
후처리와 과제 특화 프롬프트가 분석 작업 성과를 소폭 향상시킬 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.