QUICK REVIEW

[Paper Review] Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S. Du, Xiyu Zhai|arXiv (Cornell University)|Oct 4, 2018

Stochastic Gradient Optimization Techniques29 references418 citations

TL;DR

The paper proves that gradient descent with random initialization globally minimizes training loss for two-layer ReLU networks when over-parameterized, achieving linear convergence under mild assumptions.

ABSTRACT

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

Motivation & Objective

Demystify why randomly initialized first-order methods find global minima for over-parameterized ReLU networks.
Provide a rigorous convergence analysis for gradient descent on a two-layer network under non-convex, non-smooth objectives.
Show that over-parameterization and random initialization keep weights close to initialization, enabling a convex-like analysis.
Extend insights toward analyzing deeper models and other first-order methods.

Proposed method

Model: two-layer fully connected ReLU network with f(W,a,x) = (1/√m) sum_r a_r σ(w_r^T x).
Optimize the first layer with gradient descent while keeping the second layer fixed, then extend to joint training.
Introduce Gram matrix H(t) with entries H_ij(t) = (1/m) x_i^T x_j sum_r I{w_r^T x_i ≥ 0, w_r^T x_j ≥ 0}.
Show that predictions u_i(t) evolve as du/dt = H(t)(y−u), linking convergence to the spectrum of H∞ (the initialization Gram matrix under random weights).
Prove that with m large enough (and no two inputs parallel), λ_min(H(0)) ≥ (3/4)λ0 and ∥H(t)−H(0)∥2 ≤ O(1/√m).
Provide discrete-time gradient descent results with step size η = O(λ0/n^2) yielding linear convergence.

Experimental results

Research questions

RQ1Under what conditions does gradient descent converge to zero training loss for a two-layer ReLU network?
RQ2How do over-parameterization and random initialization influence the dynamics of the learning process?
RQ3Can the training dynamics be characterized by a stable Gram matrix, enabling a convex-like convergence analysis?
RQ4Does the analysis extend to jointly training both layers or only the first layer?
RQ5What is the convergence rate and required width m to guarantee linear convergence?

Key findings

Gradient descent converges to zero training loss at a linear rate when m = Ω(n^6/λ0^4 δ^3) and no two inputs are parallel.
The dynamics of predictions are governed by a time-varying Gram matrix H(t), which remains close to its initialization H∞ under over-parameterization.
With high probability, the least eigenvalue of H(0) is positive if inputs are not parallel, enabling linear convergence.
For gradient flow, the distance from initialization remains bounded (weights stay close to initial values) during training.
Joint training of both layers yields the same linear convergence under similar over-parameterization requirements.
Discrete-time gradient descent with a constant step size η = O(λ0/n^2) achieves the same linear convergence rate.
The analysis relies on standard concentration bounds and perturbation theory, without requiring Gaussian inputs or label generation assumptions.
The framework suggests potential generalization to deeper networks and other first-order methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.