[Paper Review] Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks
The paper proves that gradient descent with early stopping yields robustness to label noise in overparameterized one-hidden-layer networks under a clusterable data model, by showing the final model stays close to initialization and ignores corrupted labels until overfitting would require large movement.
Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including pure noise. Despite this, somewhat paradoxically, neural network models trained via first-order methods continue to predict well on yet unseen test data. This paper takes a step towards demystifying this phenomena. Under a rich dataset model, we show that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization. In particular, we prove that: (i) In the first few iterations where the updates are still in the vicinity of the initialization gradient descent only fits to the correct labels essentially ignoring the noisy labels. (ii) to start to overfit to the noisy labels network must stray rather far from from the initialization which can only occur after many more iterations. Together, these results show that gradient descent with early stopping is provably robust to label noise and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting.
Motivation & Objective
- Motivate and analyze why overparameterized neural networks trained with first-order methods generalize well in the presence of label noise.
- Develop a theoretical framework showing robustness of gradient descent with early stopping to a constant fraction of corrupted labels.
- Characterize how distance from initialization governs robustness versus overfitting.
- Provide conditions under which early stopping prevents overfitting and enables correct label recovery.
Proposed method
- Model: one-hidden-layer neural network with k hidden units and fixed output weights, trained by gradient descent on a squared loss.
- Data: clusterable dataset with K clusters, up to K̄ ≤ K classes, and noisy/corrupted labels defined by a corruption fraction ρ per cluster.
- Key tool: neural net covariance Σ(C) built from cluster centers C and activation derivative, with minimum eigenvalue λ(C) indicating class separability.
- Prove that gradient descent with step size η = constant × K/(n) × 1/||C||^2 achieves a solution within a neighborhood of initialization after T iterations, correctly predicting true labels for near-cluster inputs.
- Show that the residual decomposes into a clean residual aligned with large singular subspace and a noise residual in a small subspace, leading to robustness under early stopping.
- Demonstrate that to overfit the noisy labels one must travel far from initialization, linking robustness to distance from initialization.
Experimental results
Research questions
- RQ1Can gradient descent with early stopping provably learn correct labels in the presence of label noise for overparameterized networks?
- RQ2How does the data geometry, via cluster centers and the neural net covariance λ(C), affect robustness to corrupted labels?
- RQ3What is the role of distance traveled from initialization in preventing overfitting to noisy labels?
- RQ4How much label corruption can be tolerated while preserving correct prediction on inputs near cluster centers?
Key findings
- Gradient descent with early stopping remains robust to a constant fraction of corrupted labels, achieving correct label prediction for inputs near cluster centers.
- The method requires the final parameters to stay close to initialization; moving far is associated with overfitting to noisy labels.
- Robustness holds with high probability under specified dataset and network conditions, including a bound on corruption ρ ≤ δ/8.
- The iteration count to achieve robustness is modest, scaling with the data geometry via λ(C) and ||C||, and is typically O(K) up to conditioning.
- Under mild normalization, robustness and final predictive accuracy are independent of the network size, relying instead on the cluster structure and distance from initialization.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.