[论文解读] Language Models are Few-Shot Learners
GPT-3,一个具有1750亿参数的自回归模型,在不进行梯度更新的情况下,在多样化的NLP任务中展示出强烈的就地学习(few-shot),其性能随模型规模和示例数量的增加而提升。
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
研究动机与目标
- 通过探索 few-shot、one-shot 和 zero-shot 设置,推动移除特定任务微调的动机。
- 评估增加模型规模和上下文对就地学习在多样化NLP任务中的体现。
- 评估大规模语言模型的局限性、数据污染风险和社会影响。
提出的方法
- 使用一个具有交替密集与稀疏注意模式的变换器,训练八种GPT-3模型规模,参数量从125M到175B。
- 在混合精选与过滤后的数据集上进行预训练(Common Crawl、WebText、Books、Wikipedia),总计300B token。
- 在零-shot、一-shot、few-shot 设置下进行评估,通过自然语言提示和演示在2048-token上下文窗口内进行条件设定。
- 使用任务相关的评估指标(F1、BLEU、exact match)以及束搜索对自由文本完成进行评估。
- 调查数据污染并报告与测试集的潜在重叠,指出重叠可能使结果夸大之处。
- 在适用的情况下,将性能与最先进的微调模型进行对比。
实验结果
研究问题
- RQ1在zero-shot、one-shot 和 few-shot 条件下,GPT-3 在广泛的NLP任务中表现如何?
- RQ2增大模型规模是否会提升就地学习效率和各任务的few-shot表现?
- RQ3大规模就地学习的局限性与失败模式是什么?
- RQ4数据污染在多大程度上影响基准任务的报告结果?
主要发现
- GPT-3 在许多NLP数据集上表现出强烈的 few-shot 性能,有时与微调的最先进模型相竞争甚至超越。
- 零-shot 性能随模型规模稳步提升,而 few-shot 性能则随着规模和示例的增加更迅速提升。
- 在 few-shot 设置中,GPT-3 能完成需要即时推理的任务,如打乱词序后的重排、三位数运算,并能生成类似人类写作的合成新闻报道。
- 在某些NLI和阅读理解基准上,GPT-3 在 few-shot 设置下仍然具有挑战性。
- 数据污染对大多数数据集影响较小,但可能在少数基准上夸大结果,因而作者采取部分报告结果。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。