QUICK REVIEW

[论文解读] Language Models are Few-Shot Learners

T. B. Brown, Benjamin Mann|arXiv (Cornell University)|May 28, 2020

Topic Modeling参考文献 127被引用 3,027

一句话总结

GPT-3，一个具有1750亿参数的自回归模型，在不进行梯度更新的情况下，在多样化的NLP任务中展示出强烈的就地学习（few-shot），其性能随模型规模和示例数量的增加而提升。

ABSTRACT

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

研究动机与目标

通过探索 few-shot、one-shot 和 zero-shot 设置，推动移除特定任务微调的动机。
评估增加模型规模和上下文对就地学习在多样化NLP任务中的体现。
评估大规模语言模型的局限性、数据污染风险和社会影响。

提出的方法

使用一个具有交替密集与稀疏注意模式的变换器，训练八种GPT-3模型规模，参数量从125M到175B。
在混合精选与过滤后的数据集上进行预训练（Common Crawl、WebText、Books、Wikipedia），总计300B token。
在零-shot、一-shot、few-shot 设置下进行评估，通过自然语言提示和演示在2048-token上下文窗口内进行条件设定。
使用任务相关的评估指标（F1、BLEU、exact match）以及束搜索对自由文本完成进行评估。
调查数据污染并报告与测试集的潜在重叠，指出重叠可能使结果夸大之处。
在适用的情况下，将性能与最先进的微调模型进行对比。

实验结果

研究问题

RQ1在zero-shot、one-shot 和 few-shot 条件下，GPT-3 在广泛的NLP任务中表现如何？
RQ2增大模型规模是否会提升就地学习效率和各任务的few-shot表现？
RQ3大规模就地学习的局限性与失败模式是什么？
RQ4数据污染在多大程度上影响基准任务的报告结果？

主要发现

GPT-3 在许多NLP数据集上表现出强烈的 few-shot 性能，有时与微调的最先进模型相竞争甚至超越。
零-shot 性能随模型规模稳步提升，而 few-shot 性能则随着规模和示例的增加更迅速提升。
在 few-shot 设置中，GPT-3 能完成需要即时推理的任务，如打乱词序后的重排、三位数运算，并能生成类似人类写作的合成新闻报道。
在某些NLI和阅读理解基准上，GPT-3 在 few-shot 设置下仍然具有挑战性。
数据污染对大多数数据集影响较小，但可能在少数基准上夸大结果，因而作者采取部分报告结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。