[论文解读] ChatGPT: Jack of all trades, master of none
本文自动评估 ChatGPT 与 GPT-4 在 25 个多样化的 NLP 任务(语义与语用)上的表现,与 SOTA 进行比较,研究个性化,并揭示偏见。
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
研究动机与目标
- 评估 ChatGPT 与 GPT-4 在广泛的 NLP 分析任务上的表现,包括主观与客观问题。
- 将 ChatGPT 与 GPT-4 的结果与 state-of-the-art (SOTA) 基准进行对比。
- 研究个性化与上下文对预测的影响。
- 分析模型输出中的潜在偏见和道德考量。
- 提供对评估方法学以及大语言模型对社会有用性的启示。
提出的方法
- 在 25 个公开 NLP 数据集上进行自动化提示,覆盖语义与语用任务的 ChatGPT 和 GPT-4。
- 对模型输出进行人工评估,关注任务正确性而非文本质量或风格。
- 结合数据集报道的指标和重新实现的基线,与可用的 SOTA 结果进行比较。
- 纳入个性化提示(Random Contextual Few-Shot Personalization)以测试主观任务。
- 通过标注数据集来源来评估模型偏见和潜在的训练数据泄露。
- 对后处理需求和模型局限性进行定性分析。
实验结果
研究问题
- RQ1Q1: 在不同任务类型上,ChatGPT 相对于 SOTA 的性能损失是否不同,以及 GPT-4 的表现如何?
- RQ2Q3: Random Contextual Few-Shot Personalization 是否能提升主观推理与整体推理质量?
- RQ3Q4: 在处理多个相关提示时,上下文如何影响答案?
- RQ4Q6: 在所评估的任务中,GPT-4 是更好还是更差于 ChatGPT?
- RQ5Q8: 哪些后处理步骤可以在分析任务中提升 ChatGPT 的输出质量?
主要发现
| ID | Task Name | Category | Language | NLP Problem | Context | Reasoning Type | Dataset / SOTA | Availability | Trained | #Test | #Used | #None | #Post-processed | #N | #Classes | #Majority/ minorit y class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Aggression | P | EN | Offensiveness detection | No | Binary classification | WikiDetox Aggr. [62] / [63] | 3 | Yes | 23153 | 1000 | 13 | 151 | (15.1%) | 987 | |
| 2 | AggressionPer | P | EN | Offensiveness det.: personalized | Yes | Binary classification | WikiDetox Aggr. [62] / [21] | 2 | No | 349582 | 1000 | 19 | 92 | (9.2%) | 981 | |
| 2 | CoLa | S | EN | Linguistic acceptability | No | Binary classification | CoLA [64] / [65] | 3 | Yes | 1042 | 1042 | 0 | 0 (0%) | 1042 | ||
| 2 | ColBERT | P | EN | Humor recognition | No | Binary classification | ColBERT [66] / [66] | 2 | No | 40000 | 1000 | 5 | 93 | (9.3%) | 995 | |
| 2 | Sarcasm | P | EN | Humor recognition | No | Binary classification | Sarcasmania [67] / [68] | 3 | Yes | 5967 | 1000 | 10 | 61 | (6.1%) | 990 | |
| 2 | Spam | P | EN | Spam detection | No | Binary classification | SMS Spam v.1 [69] / [70] | 3 | Yes | 1115 | 1115 | 3 | 14 | (1.3%) | 1112 | |
| 2 | WordContext | S | EN | Word sense disambiguation | Yes | Binary pair classification | WiC [71] / [72] | 3 | No | 638 | 638 | 0 | 5 | (0.8%) | 638 | |
| 2 | TextEntail | S | EN | Natural language inference | No | Binary sentence pair classification | RTE [73] / [72] | 3 | Yes | 277 | 277 | 0 | 0 (0%) | 277 | ||
| 2 | WNLI | S | EN | Natural language inference | No | Binary sentence pair classification | WNLI [74] / [75] | 3 | Yes | 71 | 71 | 0 | 0 (0%) | 71 | ||
| 2 | SQuAD | S | EN | Question answering | Yes | Extractive QA | SQuAD v2 [76] / [77] | 3 | Yes | 11873 | 1000 | 0 | 247 | (24.7%) | 1000 | |
| 2 | MathQA | S | EN | Question answering | No | Mathematical reasoning | GSM8K [78] / [79] | 3 | Yes | 1319 | 1000 | 0 | 1 | (0.1%) | 999 | |
| 12 | ClarinEmo | P | PL | Emotion recognition | No | Multi-label classification | ClarinEmo - / - | 0 | No | 1264 | 1264 | 0 | 9 | (0.7%) | 1264 | |
| 13 | GoEmo | P | EN | Emotion recognition | No | Multi-label classification | GoEmotions [80] / [81] | 3 | No | 5427 | 1000 | 18 | 87 | (8.7%) | 1000 | |
| 14 | GoEmoPer0 | P | EN | Emotion rec.: personalized | No | Multi-label classification | GoEmotions [80] / [81] | 2 | No | 19470 | 1151 | 28 | 1 | (0.1%) | 1123 | |
| 15 | GoEmoPer1 | P | EN | Emotion rec.: personalized | Yes | Multi-label classification | GoEmotions [80] / [81] | 2 | No | 19470 | 1151 | 11 | 0 | (0%) | 1140 | |
| 16 | GoEmoPer2 | P | EN | Emotion rec.: personalized | Yes | Multi-label classification | GoEmotions [80] / [81] | 2 | No | 19470 | 1151 | 10 | 0 | (0%) | 1141 | |
| 17 | GoEmoPer3 | P | EN | Emotion rec.: personalized | Yes | Multi-label classification | GoEmotions [80] / [81] | 2 | No | 19470 | 1151 | 10 | 0 | (0%) | 1141 | |
| 18 | Unhealthy | P | EN | Offensiveness detection | No | Multi-label classification | Unhealthy Conv. [82] / [82] | 3 | No | 44354 | 1000 | 22 | 348 | (34.8%) | 963 | |
| 19 | UnhealthyPer | P | EN | Offensiveness det.: personalized | Yes | Multi-label classification | Unhealthy Conv. [82] / [20] | 2 | No | 227975 | 1000 | 9 | 15 | (1.5%) | 991 | |
| 20 | PolEmo | P | PL | Sentiment analysis | No | Multiclass classification | PolEmo2 [83] / [83] | 1 | No | 820 | 820 | 3 | 23 | (2.8%) | 817 | |
| 21 | TweetEmoji | P | EN | Emoji prediction | No | Multiclass classification | TweetEval [84] / [85] | 2 | No | 50000 | 1666 | 2 | 0 | (0%) | 1664 | |
| 22 | TweetSent | P | EN | Sentiment analysis | No | Multiclass classification | TweetEval [84] / [85] | 2 | No | 12283 | 5143 | 0 | 245 | (4.8%) | 5143 | |
| 23 | TweetStance | S | EN | Stance detection | No | Multiclass classification | TweetEval [84] / [85] | 2 | No | 1249 | 1249 | 7 | 99 | (7.9%) | 1249 | |
| 24 | ReAding | S | EN | Question answering | Yes | Multiple choice QA | RACE [86] / [87] | 3 | Yes | 4887 | 1000 | 4 | 206 | (20.6%) | 996 | |
| 25 | WSD | S | EN | Word sense disambiguation | Yes | Sequence labeling | Raganato [88] / [89] | 3 | Yes | 7253 | 7253 | 5 | 176 | (2.4%) | 7253 |
- ChatGPT 在多数任务的零-shot 与少样本设置下,相对于 SOTA 的平均性能损失约为 ~25%。
- GPT-4 在语义任务上呈现显著较小的语义损失,相较于 ChatGPT 更优。
- 个性化、情境化的提示(Random Contextual Few-Shot Personalization)能显著提升更符合用户的预测。
- 在更难的任务(SOTA 基线较低的任务)上表现衰减更明显,尤其在语用任务如情感识别。
- 存在 ChatGPT 输出的偏见证据,可能源于人工训练师指南和系统策略。
- 后处理与特定任务的提示能在分析任务的结果上带来适度改善。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。