QUICK REVIEW

[論文レビュー] Language Models are Few-Shot Learners

T. B. Brown, Benjamin Mann|arXiv (Cornell University)|May 28, 2020

Topic Modeling参考文献 127被引用数 3,027

ひとこと要約

GPT-3、175Bパラメータの自己回帰モデルは、勾配更新なしで多様なNLPタスクに対して強力な文脈内（few-shot）学習を示し、モデルサイズとデモンストレーションの増加に伴い性能がスケールします。

ABSTRACT

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

研究の動機と目的

タスク特化のファインチューニングを排除する動機づけとして、few-shot、one-shot、zero-shotの設定を探索する。
モデルサイズとコンテキストを増やすことが、さまざまなNLPタスクにおける文脈内学習をどのように示すかを評価する。
大規模言語モデルの限界、データ汚染リスク、社会的影響を評価する。

提案手法

125M〜175Bパラメータの8つのGPT-3モデルサイズを、密集アテンションと疎結合アテンションパターンを交互に用いるトランスフォーマーでトレーニングする。
Curatedとfilteredデータセット（Common Crawl、WebText、Books、Wikipedia）を合計300Bトークンでプレトレインする。
2048トークンのコンテキストウィンドウ内で自然言語プロンプトとデモンストレーションで条件付けして、zero-shot、one-shot、few-shot設定を評価する。
自由形式の完了のためにビームサーチを用い、評価指標としてタスクに適したF1、BLEU、exacts_matchを用いる。
データ汚染を調査し、テストセットとの潜在的な重複について報告し、重複が結果を過大評価する可能性がある箇所に注記する。
適用可能な場合には、最先端のファインチューニングモデルと性能を比較する。

実験結果

リサーチクエスチョン

RQ1GPT-3はzero-shot、one-shot、few-shot条件下で、幅広いNLPタスクでどのように性能を発揮するか？
RQ2モデルサイズの増加は、タスク間で文脈内学習効率とfew-shot性能を改善するか？
RQ3大規模な文脈内学習の限界と失敗モードは何か？
RQ4ベンチマークタスクにおけるデータ汚染は報告された結果にどの程度影響するか？

主な発見

GPT-3は多くのNLPデータセットで強力なfew-shot性能を示し、時にはファインチューニング済みの最先端モデルに匹敵または上回ることがある。
zero-shotの性能はモデルサイズとともに着実に向上する一方、few-shotの性能はサイズとデモンストレーションの増加によりより急速に向上する。
few-shot設定では、即時の推論を要するタスク（単語の並べ替えの解く、3桁算 arithmeticなど）をこなすことができ、ヒトの文章に似た人工ニュース記事を生成することもできる。
一部のタスクでは、GPT-3はfew-shot設定で依然として難しく、特定のNLIや読解ベンチマークを含む。
データ汚染はほとんどのデータセットで影響が小さいが、いくつかのベンチマークでは結果を過大評価する可能性があるため、著者は部分的に結果を報告した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。