QUICK REVIEW

[論文レビュー] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Thomas J. Wang, Adam Roberts|arXiv (Cornell University)|Apr 12, 2022

Topic Modeling被引用数 23

ひとこと要約

この論文は大規模言語モデルのアーキテクチャと事前学習目的の組み合わせを体系的に比較し、デコーダーのみのモデルが全LM事前学習で多タスク微調整より優れること、エンコーダ-デコーダモデルはMLMで多タスク微調整後に優れることを示し、アーキテクチャ間の適応経路を実証する。

ABSTRACT

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

研究の動機と目的

無監督事前学習下で、アーキテクチャ（因果型デコーダー専用、非因果デコーダー、エンコーダ-デコーダ）がゼロショット一般化に与える影響を評価する。
アーキテクチャ間での事前学習目的（FLM、PLM、MLM）がゼロショット課題に与える影響を評価する。
マルチタスク微調整がゼロショット一般化のための好ましいアーキテクチャ/目的を変えるかを調査する。
効率的に長所を転移するためのアーキテクチャ間/目的間の適応を探求する。
生成型プロンプティング向けとマルチタスク微調整向けに最適化したLLM設計の実用的な指針を提供する。

提案手法

約5Bパラメータの<アーキテクチャ、目的>ペアを系統的に事前学習する（ED: 11B、CD: 4.8B）。168Bトークンで。
アーキテクチャごとにFLM、PLM、MLMの目的を比較し、マルチタスク微調整（MT-F）の有無で比較する。
適応技術を適用する：LM-A（MLM→PLM/FLM）、および非因果MLM適応を用いてアーキテクチャタイプ間の変換。
13BトークンのT0風ミックスでMT-Fを実施し、T0-EvalおよびEAI-Evalのプロンプトから30タスクでゼロショットを評価する。
チェックポイント（42B、84B、168Bトークン）で結果を報告する。
タスクを横断して一貫したプロンプティングを用いた、2つのゼロショットベンチマーク（T0-EvalとEAI-Eval）を使用する。

実験結果

リサーチクエスチョン

RQ1無監督事前学習の直後に最も強いゼロショット一般化を生むアーキテクチャ–目的の組み合わせはどれか？
RQ2マルチタスク微調整はゼロショット一般化のための好ましいアーキテクチャおよび/または目的をどのように変えるか？
RQ3アーキテクチャ/目的のギャップは適応によって全再学習なしに効率的に埋められるか？
RQ4異なるプロンプト/タスクベンチマーク（T0-Eval vs. EAI-Eval）はモデルのランキングを特定のアーキテクチャへ偏らせるか？
RQ5生成型プロンプティング向けとマルチタスク微調整向けに最適化したLLM設計へどのような実践的指針が生まれるか？

主な発見

Model	EAI-Eval	T0-Eval	Notes
Causal decoder	44.2	42.4	Best for EAI-Eval among FLM-trained after pretraining
Non-causal decoder	43.5	41.8	Second best on EAI-Eval after FLM/PLM post-pretraining
Encoder-decoder	39.9	41.7	Strong baseline; encoder-decoder MLM excels after MT-F
Random baseline	32.9	41.7	Random performance baseline for reference

無監督事前学習のみの後、因果デコーダー専用モデルはフル言語モデリングを有することで2つのベンチマークを通じて最高のゼロショット一般化を達成する。
マルチタスク微調整後はMLM事前学習を持つエンコーダ-デコーダモデルが他を上回り、MT-Fがエンコーダ-デコーダとMLMを好むようにシフトすることを示す。
MT-F後、MLMを事前学習したエンコーダ-デコーダモデルが他の構成より優れており、一部のベンチマークでは非因果MLMが僅差で追随する。
適応手法は収束を速め、アーキテクチャを横断する転移を効果的に可能にする。例：MLM適応済み非因果デコーダーを因果デコーダーへ転移させるとMLMとMT-Fの性能が向上；因果→非因果の適応も有益。
プロンプトとタスクセットはゼロショット性能に影響を与える。EAI-Evalのプロンプトは一般的に平均的なT0プロンプトより高い性能を生み、アーキテクチャ間の差はタスク依存である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。