[Paper Review] Scaling Laws for Deep Learning
This thesis demonstrates that deep learning training and pruning follow predictable scaling laws across vision and language tasks, and it offers a constructive framework to predict performance from small-scale measurements and proposes directions toward Nyquist learners to reach near-ideal generalization under finite data.
Running faster will only get you so far -- it is generally advisable to first understand where the roads lead, then get a car ... The renaissance of machine learning (ML) and deep learning (DL) over the last decade is accompanied by an unscalable computational cost, limiting its advancement and weighing on the field in practice. In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs. We first demonstrate that DL training and pruning are predictable and governed by scaling laws -- for state of the art models and tasks, spanning image classification and language modeling, as well as for state of the art model compression via iterative pruning. Predictability, via the establishment of these scaling laws, provides the path for principled design and trade-off reasoning, currently largely lacking in the field. We then continue to analyze the sources of the scaling laws, offering an approximation-theoretic view and showing through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit. We conclude by building on the gained theoretical understanding of the scaling laws' origins. We present a conjectural path to eliminate one of the current dominant error sources -- through a data bandwidth limiting hypothesis and the introduction of Nyquist learners -- which can, in principle, reach the generalization error lower limit (e.g. 0 in the noiseless case), at finite dataset size.
Motivation & Objective
- Understand how generalization error scales with data size and model capacity across state-of-the-art tasks.
- Develop a constructive, predictive law for model performance from small-scale measurements.
- Extend scaling analysis to pruning and compression to inform deployment decisions.
- Investigate the origins of scaling laws through an approximation-based viewpoint.
- Propose future directions toward reducing error via data bandwidth limits and Nyquist learners.
Proposed method
- Empirically characterize generalization error across diverse datasets (vision and language) and model scales.
- Fit a joint functional form (scaling law) describing error as a function of data size and model size.
- Extend the scaling framework to Iterative Magnitude Pruning (IMP) to model pruned networks.
- Analyze error sources within an approximation-theoretic framework (realizability, uncertainty, learning deficiency, noise).
- Construct a realizable teacher-student setup to isolate error sources and test predictions.
- Propose theoretical pathways (data bandwidth limit, Nyquist learners) to approach the lower generalization error bound.
Experimental results
Research questions
- RQ1What is the functional relationship between generalization error, data size, and model capacity in state-of-the-art models?
- RQ2Can a constructive, predictive scaling law specify the exact model configuration needed to attain a target error at different data scales?
- RQ3How does pruning (IMP) affect generalization error, and can a joint scaling law describe all pruned network configurations?
- RQ4Which sources of error dominate deep learning generalization, and how do they influence scaling behavior?
- RQ5What theoretical conditions could enable reaching near-optimal generalization with finite data (Nyquist learners)?
Key findings
- A joint scaling law accurately describes generalization error as a function of both data size and model size across vision and language tasks.
- Pruning through iterative magnitude pruning follows a predictable scaling law and there exists an invariant allowing error-preserving interchangeability among depth, width, and pruning density.
- An approximation-centric view identifies uncertainty and learning deficiencies as dominant error sources over realizability in studied regimes.
- A realizable teacher-student setup shows realizability is not the sole driver of error, strengthening the case for other dominant error sources.
- A conjectural path toward Nyquist learners suggests data-bandwidth limitations could enable reaching lower error bounds at finite data.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.