QUICK REVIEW

[论文解读] Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Yi Tay, Mostafa Dehghani|arXiv (Cornell University)|Sep 22, 2021

Topic Modeling参考文献 50被引用 58

一句话总结

该论文从实证角度研究Transformer在预训练和微调的扩展，结果表明模型形状对下游迁移有影响，扩展效应因计算区间而异，并提出 DeepNarrow 扩展以在参数较少且训练更快的情况下实现Pareto高效模型。

ABSTRACT

There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.

研究动机与目标

评估上游预训练规模如何与下游迁移性能相关。
研究模型形状（深度与宽度）如何影响跨任务的微调结果。
描述在不同计算区域和不同模型规模下的扩展行为。
确定在Transformer迁移学习中实用且Pareto高效的扩展策略。
提供预训练检查点和工具，便于未来的扩展研究。

提出的方法

使用基于 T5 架构的相对注意力的编码器-解码器 Transformer，覆盖从 tiny 到 XXXL 的广泛尺寸。
在 Colossal Cleaned Common Crawl (C4) 上使用基于跨度的 MLM 进行预训练，步骤数为 2^19，在 TPU-v3 硬件上。
在包括 GLUE、SuperGLUE 和 SQuAD 的 17 个下游任务上进行微调，报告 SuperGLUE 的综合准确率。
系统性地改变扩展算子（深度、宽度、隐藏维度、KV、heads 等），并衡量上游 perplexity 和下游迁移。
分析配置的 Pareto 前沿，以评估在参数、FLOPs 和吞吐量方面的效率。
公开发布超过 100 个预训练检查点，并对 Vision Transformers (ViT) 进行跨领域检查。

实验结果

研究问题

RQ1在上游预训练中观察到的扩展行为是否会推广到预训练-微调设置中的下游迁移？
RQ2模型形状（深度 vs 宽度）如何影响跨任务的下游迁移性能？
RQ3扩展策略是否在不同计算区域（小规模 vs 大规模）及模态之间带来相同的效率？
RQ4我们能否推导出在不牺牲下游质量的前提下提升Pareto效率的实用扩展协议？
RQ5研究结果是否在自然语言处理任务中保持一致，并可迁移到如 ViT 的视觉模型？

主要发现

下游迁移性能强烈依赖于模型形状，而不仅仅是参数数量，与上游趋势形成对比。
预训练困惑度通常会误导性地预测下游质量；上游的增益并不总是能转化为下游任务的提升。
像 T5-base/Large 这样的典型尺寸相比于经过精心选择的替代配置，Pareto 效率较低。
扩展效应在不同计算区间之间存在差异；在小规模有效的策略可能并不适用于更大规模的计算区域。
DeepNarrow 扩展（优先深度再宽度）在使用更少参数并且训练更快的情况下，仍能实现与下游性能相似或更好的 Pareto 高效模型；该方法也可迁移到 ViT 以及 GLUE/SuperGLUE/SQuAD 之外的额外 NLP 任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。