QUICK REVIEW

[论文解读] Scaling Laws for Transfer

Danny Hernandez, Jared Kaplan|arXiv (Cornell University)|Feb 2, 2021

Topic Modeling参考文献 26被引用 26

一句话总结

本文在无监督微调中导出跨分布的经验性缩放定律，提出有效传输数据 D_T，并显示其在模型规模和微调数据跨数量级时呈幂律关系。

ABSTRACT

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.

研究动机与目标

在无监督微调设置中表征分布之间的传输。
量化预训练如何通过有效数据传输指标 D_T 提高数据效率。
识别连接模型规模、微调数据和传输数据的幂律关系。
评估在数据受限情境下，预训练何时有益或有害（骨化现象）。

提出的方法

在广范围的模型规模（4 个数量级）和数据情境下训练变换器模型（从零开始、先进行语言预训练再在代码上微调，以及混合预训练）。
定义并计算 D_T，即有效传输数据量；即同尺寸的从零开始模型达到相同下游任务损失所需的数据量。
将 D_T 拟合为幂律形式 D_T = k (D_F)^{alpha} (N)^{beta}，并分析 alpha、beta、k 如何随分布变化。
使用交叉熵损失 L 来评估性能并确定低数据与高数据情境（D_F 相对于 D(N)）。
比较文本到代码的传输与混合文本/代码预训练的传输，并评估预训练对骨化与计算效率的影响。

实验结果

研究问题

RQ1有效传输数据量 D_T 如何随模型规模 N 和微调数据 D_F 的变化而缩放？
RQ2传输系数 (k、alpha、beta) 是否依赖源分布与目标分布？它们对分布接近性的含义是什么？
RQ3在低数据条件下，预训练如何影响数据效率和计算效率的边界？
RQ4在较大数据情境下，预训练是否可能损害微调性能（骨化）？
RQ5这些缩放定律对选择预训练数据组成和模型规模的实际意义是什么？

主要发现

D_T 在低数据情境下遵循幂律：D_T = k (D_F)^{alpha} (N)^{beta}。
文本到 Python 的传输中，beta 约为 0.38，alpha 约为 0.18，k 约为 1.9e4；当文本占比 50% 且非 Python 的代码占比 50% 时，beta 约为 0.38，alpha 约为 0.096，k 约为 2.1e5。
预训练在低数据情境下有效放大微调数据集，提升数据效率并在微调上实现更好的计算效率。
在高数据情境下，尤其是在小模型上训练非常大下游数据集时，预训练可能损害适应性（骨化）。
传输系数提供一种廉价、方向性的分布接近性度量，可帮助在收集微调数据与增加模型规模之间做权衡。
就数据情境而言，相比从头开始训练，微调在低数据情境下通常仍具备计算效率优势，尽管随着下游数据增加这一优势会减弱。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。