QUICK REVIEW

[论文解读] What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Thomas J. Wang, Adam Roberts|arXiv (Cornell University)|Apr 12, 2022

Topic Modeling被引用 23

一句话总结

本文系统比较大语言模型的架构与预训练目标组合，结果显示仅解码器模型在完整语言模型预训练下在多任务微调之前表现出色；而带有 MLM 的编码器-解码器模型在多任务微调之后表现出色，并展示了架构之间的适应路径。

ABSTRACT

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at https://github.com/bigscience-workshop/architecture-objective.

研究动机与目标

评估架构（因果解码器-仅、非因果解码器、编码器-解码器）在无监督预训练下对零-shot泛化的影响。
在不同架构上评估预训练目标（FLM、PLM、MLM）对零-shot任务的影响。
研究多任务微调是否会改变零-shot泛化的偏好架构/目标。
探索架构/目标之间的适配，以高效地转移优势。
为设计面向生成提示与多任务微调优化的LLM提供可操作的指导。

提出的方法

系统性在~5B参数下对六对<架构、目标>组合进行预训练（ED: 11B, CD: 4.8B），数据量为168B tokens。
在每种架构上比较FLM、PLM、MLM目标，评估是否进行MT-F（有/无）。
应用适配技术：LM-A（MLM→PLM/FLM），以及将非因果MLM适配以在架构类型之间转换。
在13B-token的T0风格混合数据上进行MT-F，并在来自T0-Eval和EAI-Eval提示的30个任务上进行零-shot评估。
在关键节点报告结果：42B、84B、168B tokens。
使用两个零-shot基准（T0-Eval和EAI-Eval），在任务上保持提示的一致性。

实验结果

研究问题

RQ1哪些架构–目标组合在无监督预训练后立即对零-shot泛化具有最强表现？
RQ2多任务微调如何改变零-shot泛化的偏好架构和/或目标？
RQ3是否可通过适配在不进行完全再训练的情况下高效弥合架构/目标之间的差距？
RQ4不同的提示/任务基准（T0-Eval vs. EAI-Eval）是否会使模型排名偏向某些架构？
RQ5针对生成提示与多任务微调优化的LLM构建，能得出哪些实际指导？

主要发现

模型	EAI-Eval	T0-Eval	备注
Causal decoder	44.2	42.4	Best for EAI-Eval among FLM-trained after pretraining
Non-causal decoder	43.5	41.8	Second best on EAI-Eval after FLM/PLM post-pretraining
Encoder-decoder	39.9	41.7	Strong baseline; encoder-decoder MLM excels after MT-F
Random baseline	32.9	41.7	Random performance baseline for reference

仅在无监督预训练之后，因果解码器模型在完整语言建模下在两个基准上的零-shot泛化表现最佳。
在多任务微调之后，带 MLM 预训练的编码器-解码器模型优于其他配置，表明 MT-F 将偏好转向编码器-解码器的 MLM。
在 MT-F 之后，使用 MLM 预训练的编码器-解码器模型优于其他配置，非因果 MLM 在某些基准上接近领先。
适配方法能加快收敛并实现有效的跨架构转移，例如将 MLM 适配到非因果解码器以改善 MLM 与 MT-F 的表现；从因果到非因果的适配也有益。
提示和任务集对零-shot表现有影响；EAI-Eval 提示通常比平均 T0 提示获得更高的性能，且不同架构之间的差异因任务而异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。