QUICK REVIEW

[論文レビュー] ChatGPT: Jack of all trades, master of none

Jan Kocoń, Igor Cichecki|arXiv (Cornell University)|Feb 21, 2023

Topic Modeling被引用数 15

ひとこと要約

この論文は、25の多様なNLPタスク（意味論的および語用論的）にわたるChatGPTとGPT-4の評価を自動化し、SOTAとの性能比較、個人化の研究、およびバイアスの開示を行う。

ABSTRACT

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.

研究の動機と目的

ChatGPTおよびGPT-4の性能を、主観的・客観的な問題を含む広範なNLP分析タスクで評価する。
ChatGPTとGPT-4の結果をSOTAベンチマークと比較する。
予測に対する個人化と文脈の影響を調査する。
モデルの応答における潜在的なバイアスと倫理的配慮を分析する。
評価方法論とLLMの社会的有用性に対する示唆を提供する。

提案手法

意味論的および語用論的タスクを網羅する25の公開NLPデータセットにわたり、ChatGPTとGPT-4を自動 promptingする。
タスクの正しさ（テキスト品質やスタイルではなく）をモデル出力の手動評価で判断する。
データセット報告値と再実装ベースラインを用いて利用可能なSOTA結果と比較する。
主観的タスクを testするための個人化プロンプト（Random Contextual Few-Shot Personalization）を含める。
データセットの出典を注釈付けすることでトレーニングデータ漏洩の可能性とモデルバイアスを評価する。
後処理のニーズとモデルの限界の定性的分析。

実験結果

リサーチクエスチョン

RQ1Q1: ChatGPTのSOTAに対する性能低下はタスクタイプで異なるか、GPT-4はどうか。
RQ2Q3: Random Contextual Few-Shot Personalizationは主観的推論と全体的推論品質を改善できるか。
RQ3Q4: 複数の関連するプロンプトを処理する際、文脈は回答にどのような影響を与えるか。
RQ4Q6: 評価したタスク全体でGPT-4はChatGPTより良いか悪いか。
RQ5Q8: 分析タスクの出力品質を改善する後処理手順は何か。

主な発見

ID	タスク名	カテゴリ	言語	NLP問題	文脈	推論タイプ	データセット / SOTA	入手可能性	訓練済み	#Test	#Used	#None	#後処理済み	#N	#クラス
1	Aggression	P	EN	Offensiveness detection	No	Binary classification	WikiDetox Aggr. [62] / [63]	3	Yes	23153	1000	13	151	(15.1%)	987
2	AggressionPer	P	EN	Offensiveness det.: personalized	Yes	Binary classification	WikiDetox Aggr. [62] / [21]	2	No	349582	1000	19	92	(9.2%)	981
2	CoLa	S	EN	Linguistic acceptability	No	Binary classification	CoLA [64] / [65]	3	Yes	1042	1042	0	0 (0%)	1042
2	ColBERT	P	EN	Humor recognition	No	Binary classification	ColBERT [66] / [66]	2	No	40000	1000	5	93	(9.3%)	995
2	Sarcasm	P	EN	Humor recognition	No	Binary classification	Sarcasmania [67] / [68]	3	Yes	5967	1000	10	61	(6.1%)	990
2	Spam	P	EN	Spam detection	No	Binary classification	SMS Spam v.1 [69] / [70]	3	Yes	1115	1115	3	14	(1.3%)	1112
2	WordContext	S	EN	Word sense disambiguation	Yes	Binary pair classification	WiC [71] / [72]	3	No	638	638	0	5	(0.8%)	638
2	TextEntail	S	EN	Natural language inference	No	Binary sentence pair classification	RTE [73] / [72]	3	Yes	277	277	0	0 (0%)	277
2	WNLI	S	EN	Natural language inference	No	Binary sentence pair classification	WNLI [74] / [75]	3	Yes	71	71	0	0 (0%)	71
2	SQuAD	S	EN	Question answering	Yes	Extractive QA	SQuAD v2 [76] / [77]	3	Yes	11873	1000	0	247	(24.7%)	1000
2	MathQA	S	EN	Question answering	No	Mathematical reasoning	GSM8K [78] / [79]	3	Yes	1319	1000	0	1	(0.1%)	999
12	ClarinEmo	P	PL	Emotion recognition	No	Multi-label classification	ClarinEmo - / -	0	No	1264	1264	0	9	(0.7%)	1264
13	GoEmo	P	EN	Emotion recognition	No	Multi-label classification	GoEmotions [80] / [81]	3	No	5427	1000	18	87	(8.7%)	1000
14	GoEmoPer0	P	EN	Emotion rec.: personalized	No	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	28	1	(0.1%)	1123
15	GoEmoPer1	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	11	0	(0%)	1140
16	GoEmoPer2	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
17	GoEmoPer3	P	EN	Emotion rec.: personalized	Yes	Multi-label classification	GoEmotions [80] / [81]	2	No	19470	1151	10	0	(0%)	1141
18	Unhealthy	P	EN	Offensiveness detection	No	Multi-label classification	Unhealthy Conv. [82] / [82]	3	No	44354	1000	22	348	(34.8%)	963
19	UnhealthyPer	P	EN	Offensiveness det.: personalized	Yes	Multi-label classification	Unhealthy Conv. [82] / [20]	2	No	227975	1000	9	15	(1.5%)	991
20	PolEmo	P	PL	Sentiment analysis	No	Multiclass classification	PolEmo2 [83] / [83]	1	No	820	820	3	23	(2.8%)	817
21	TweetEmoji	P	EN	Emoji prediction	No	Multiclass classification	TweetEval [84] / [85]	2	No	50000	1666	2	0	(0%)	1664
22	TweetSent	P	EN	Sentiment analysis	No	Multiclass classification	TweetEval [84] / [85]	2	No	12283	5143	0	245	(4.8%)	5143
23	TweetStance	S	EN	Stance detection	No	Multiclass classification	TweetEval [84] / [85]	2	No	1249	1249	7	99	(7.9%)	1249
24	ReAding	S	EN	Question answering	Yes	Multiple choice QA	RACE [86] / [87]	3	Yes	4887	1000	4	206	(20.6%)	996
25	WSD	S	EN	Word sense disambiguation	Yes	Sequence labeling	Raganato [88] / [89]	3	Yes	7253	7253	5	176	(2.4%)	7253

ChatGPTは多くのタスクでゼロショットおよび少数ショット設定においてSOTAに対する平均的な約25%の性能低下を示す。
GPT-4は意味論タスクにおいてChatGPTよりかなり小さい意味的損失を示す。
個人化・文脈的プロンプト（Random Contextual Few-Shot Personalization）はユーザーに適合した予測を大幅に改善する。
より難しいタスク（SOTAベースラインが低い場合）では性能が低下し、特に感情認識のような語用論的タスクで顕著。
ChatGPTの出力には訓練者ガイドラインやシステムポリシーに起因するバイアスの存在が示唆される。
後処理とタスク特異的プロンプティングは分析タスクの結果を控えめに改善できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。