[Paper Review] Scaling Laws for Transfer
The paper derives empirical scaling laws for transfer learning between distributions in unsupervised fine-tuning, introducing the effective data transferred D_T and showing it follows a power-law with model size and fine-tuning data across orders of magnitude.
We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited and stop improving in performance (cross-entropy loss). When we do the same for models pre-trained on a large language dataset, the slope in performance gains is merely reduced rather than going to zero. We calculate the effective data "transferred" from pre-training by determining how much data a transformer of the same size would have required to achieve the same loss when training from scratch. In other words, we focus on units of data while holding everything else fixed. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size. We believe the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). We find that pre-training effectively multiplies the fine-tuning dataset size. Transfer, like overall performance, scales predictably in terms of parameters, data, and compute.
Motivation & Objective
- Characterize transfer between distributions in unsupervised fine-tuning settings.
- Quantify how pre-training affects data efficiency via an effective data transfer metric D_T.
- Identify power-law relationships linking model size, fine-tuning data, and transferred data.
- Assess when pre-training helps or harms performance (ossification) in data-limited regimes.
Proposed method
- Train transformer models across a wide range of sizes (4 orders of magnitude) and data regimes (from scratch, language pre-training then fine-tuning on code, and mixed pre-training).
- Define and compute D_T, the effective data transferred, as the amount of data a from-scratch model of the same size would need to reach the same loss on the downstream task.
- Fit D_T to a power-law form D_T = k (D_F)^{alpha} (N)^{beta}, and analyze how alpha, beta, and k vary with distributions.
- Use cross-entropy loss L to evaluate performance and determine low-data vs high-data regimes (D_F relative to D(N)).
- Compare transfer from text to code and from mixed text/code pre-training, and assess the impact of pre-training on ossification and compute efficiency.
Experimental results
Research questions
- RQ1How does the amount of effective data transferred D_T scale with model size N and fine-tuning data D_F?
- RQ2Do transfer coefficients (k, alpha, beta) depend on the source and target distributions, and what do they imply about distribution proximity?
- RQ3Under low-data conditions, how does pre-training affect the data efficiency and the compute-efficiency frontier?
- RQ4Can pre-training ever harm fine-tuning performance (ossification) at larger data regimes?
- RQ5What are the practical implications of these scaling laws for choosing pre-training data compositions and model sizes?
Key findings
- D_T follows a power-law in the low-data regime: D_T = k (D_F)^{alpha} (N)^{beta}.
- In text-to-Python transfer, beta ≈ 0.38 and alpha ≈ 0.18, with k ≈ 1.9e4; with 50% text and 50% non-Python code, beta ≈ 0.38, alpha ≈ 0.096, and k ≈ 2.1e5.
- Pre-training effectively multiplies the fine-tuning dataset in the low-data regime, enhancing data efficiency and enabling better compute efficiency for fine-tuning.
- Ossification can occur when pre-training harms adaptation on high-data regimes, particularly for small models trained on very large downstream datasets.
- Transfer coefficients provide a cheap, directional measure of distribution proximity and can guide trade-offs between collecting fine-tuning data and increasing model size.
- Fine-tuning generally remains compute-efficient in the low-data regime compared to training from scratch, though this advantage diminishes as downstream data grows.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.