[Paper Review] Triple descent and the two kinds of overfitting: Where & why do they appear?
This paper identifies and disentangles two distinct overfitting phenomena in neural networks: a linear peak at N=D due to noise fitting in linear regression, and a nonlinear peak at N=P due to weight initialization variance in nonlinear models. Using random feature and neural network models, it shows these peaks coexist in noisy regression, with nonlinearity suppressing the linear peak while amplifying the nonlinear one, and only the latter is mitigated by regularization or ensembling.
A recent line of research has highlighted the existence of a "double descent" phenomenon in deep learning, whereby increasing the number of training examples $N$ causes the generalization error of neural networks to peak when $N$ is of the same order as the number of parameters $P$. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when $N$ is equal to the input dimension $D$. Since both peaks coincide with the interpolation threshold, they are often conflated in the litterature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at $N\!=\!P$ is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in neural networks). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at $N\!=\!D$ is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep neural networks.
Motivation & Objective
- To distinguish between two types of overfitting in neural networks: one tied to input dimension D and another to model parameters P.
- To investigate whether both overfitting peaks—linear (at N=D) and nonlinear (at N=P)—can coexist in the same model.
- To understand how the degree of nonlinearity in activation functions affects the prominence of each peak.
- To examine the impact of regularization and ensembling on each peak, and determine whether they affect both types of overfitting equally.
- To analyze the temporal dynamics of peak formation during training, particularly the order in which peaks emerge.
Proposed method
- Analyzes the test loss in random feature models with varying activation functions to isolate the effects of nonlinearity on overfitting.
- Performs bias-variance decomposition of the test loss to attribute the linear peak to noise fitting and the nonlinear peak to initialization variance.
- Uses ridge regression in the random feature model to analytically study the eigenspectrum of the Gram matrix and its relation to small eigenvalues.
- Employs numerical experiments on fully connected neural networks with ReLU, Tanh, and linear activations to validate theoretical findings.
- Applies regularization (weight decay) and ensembling (averaging over multiple random seeds) to assess their differential effects on the two peaks.
- Traces the evolution of test loss during training to compare the timing of peak formation, linking it to eigenmode learning speed.
Experimental results
Research questions
- RQ1Are the linear peak at N=D and the nonlinear peak at N=P two distinct overfitting phenomena?
- RQ2Can both peaks coexist in the same model, and if so, under what conditions?
- RQ3How does the nonlinearity of the activation function influence the relative strength of each peak?
- RQ4Can regularization or ensembling suppress both peaks equally, or only one?
- RQ5Do the two peaks form at different times during training, and if so, why?
Key findings
- The linear peak at N=D is caused solely by overfitting label noise and vanishes in the noiseless regime, confirming its origin in linear regression-like behavior.
- The nonlinear peak at N=P arises from the variance of the random feature initialization and persists even without label noise, indicating a fundamental sensitivity to weight initialization.
- Increasing nonlinearity (e.g., from linear to ReLU or Tanh) weakens the linear peak due to implicit regularization and strengthens the nonlinear peak by increasing initialization variance.
- Regularization and ensembling effectively suppress the nonlinear peak but have negligible effect on the linear peak, which is already implicitly regularized by nonlinearity.
- The nonlinear peak forms later during training than the linear peak because it depends on learning small eigenmodes of the Gram matrix, which are slow to converge.
- In the (P, N) phase space, both peaks can coexist, leading to a sample-wise triple descent curve, especially at high noise levels.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.