[Paper Review] Towards moderate overparameterization: global convergence guarantees for training shallow neural networks
The paper proves that gradient descent (and SGD) on one-hidden-layer neural networks with smooth activations or ReLUs converges to a global optimum that perfectly interpolates the training data once the number of parameters exceeds the data size by a constant factor, specifically kd^? ≥ n^2 in the smooth case and up to n^2/d in the ReLU case, with fast geometric rates.
Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).
Motivation & Objective
- Motivate and quantify the level of overparameterization needed for global convergence of first-order methods in overparameterized shallow nets.
- Show that gradient descent initialized at random converges geometrically to a global optimum that interpolates all training data.
- Extend results to ReLU activations and to SGD, providing convergence guarantees and rates.
- Bridge theory-practice gap by demonstrating that moderate overparameterization suffices, not only extremely wide networks.
Proposed method
- Analyze one-hidden-layer network f(x;W)=v^T phi(Wx) with fixed v and trained W under a quadratic loss.
- Derive gradient descent and SGD update rules and establish conditions on kd relative to n and data properties.
- Use spectral properties of Khatrio-Rao and Hadamard products, along with random matrix theory, to bound Jacobian spectra at initialization.
- Prove geometric convergence rates: ||f(W_τ)-y||_2 decays as (1 - c μ^2/B^2 …)^τ with high probability.
- Provide corollaries for standard data models (e.g., random data on the unit sphere) to illustrate kd ≳ n^2 scaling.
- Extend results to ReLU activations with adjusted overparameterization requirements and similar convergence statements.
Experimental results
Research questions
- RQ1What minimum overparameterization is required for gradient-based methods to achieve zero training error in shallow nets?
- RQ2Do random initializations and first-order methods converge to global optima when kd exceeds data size by a constant factor or more?
- RQ3How do smooth activations and ReLU activations compare in required overparameterization and convergence rates?
- RQ4Do SGD updates inherit the global convergence guarantees observed for full-batch gradient descent?
- RQ5What do these results imply about the practical gap between theory and practice in moderately overparameterized regimes?
Key findings
- Gradient descent on one-hidden-layer networks with smooth activations converges geometrically to a global optimum that perfectly fits the training data as soon as sqrt(kd) ≥ c (B^2/μ_φ^2) (1+δ) κ(X) n.
- For ReLU activations, a similar guarantee holds with sqrt(kd) ≥ C (1+δ) n^2/d κ^3(X) σ_min^2(X*X).
- Corollaries show typical scaling kd ≳ n^2 suffices in random data settings; when n ≲ d, the bound simplifies to k ≳ n, and convergence is independent of dimensions.
- SGD with random initialization also achieves fast convergence to a near-global optimum, staying close to initialization with high probability and with a rate comparable to GD under suitable parameters.
- Numerical experiments illustrate phase transitions where success probability aligns near the n=kd boundary, suggesting practical overparameterization may be close to this threshold.
- The work connects kernel-like random feature intuition (k ≲ n) with deeper optimization guarantees in moderately overparameterized regimes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.