QUICK REVIEW

[Paper Review] Gradient Descent Finds Global Minima of Deep Neural Networks

Simon S. Du, Jason D. Lee|arXiv (Cornell University)|Nov 9, 2018

Sparse and Compressive Sensing Techniques47 references198 citations

TL;DR

The paper proves that gradient descent can achieve zero training loss in polynomial time for over-parameterized deep neural networks with residual connections (ResNet) and extends to convolutional ResNets, by analyzing the stability of Gram matrices during training.

ABSTRACT

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

Motivation & Objective

Motivate understanding why randomly initialized gradient methods achieve zero training loss in deep networks.
Establish conditions under which gradient descent converges to a global minimum for deep fully-connected, ResNet, and convolutional ResNet architectures.
Develop activation and architectural assumptions that enable a rigorous stability analysis of training dynamics.

Proposed method

Define a Gram matrix framework that captures the training dynamics of deep networks.
Show that with sufficient width, the Gram matrix at initialization is close to a data- and architecture-dependent limit and remains stable during training.
Use a power-method style argument to relate convergence rate to the least eigenvalue of the limiting Gram matrix.
Derive architecture-specific recursive definitions of Gram matrices for fully-connected, ResNet, and convolutional ResNet to bound perturbations across layers.
Demonstrate that perturbation propagation is milder in ResNet due to skip connections, reducing exponential dependence on depth.
Provide convergence theorems showing linear convergence rates for gradient descent under appropriate step sizes and over-parameterization.

Experimental results

Research questions

RQ1Can gradient descent achieve zero training loss on deep, over-parameterized networks with residual connections?
RQ2How does network width and architecture (fully-connected vs. ResNet vs. convolutional ResNet) affect the required over-parameterization and convergence rate?
RQ3What role do Gram matrices play in guaranteeing global convergence and how stable are they during training?
RQ4What are the activation function and data assumptions needed to ensure positive definiteness of the Gram matrices and thus convergence?

Key findings

For deep fully-connected networks, sufficient width m ensures gradient descent converges to zero training loss at a linear rate (under specified initialization and data assumptions).
For ResNet architectures, the required width per layer grows more slowly with depth than in fully-connected nets, yielding polynomial depth dependence in convergence guarantees.
For convolutional ResNet, convergence to zero training loss holds with width poly in data size, patch count, and depth.
The analysis shows Gram matrices G(H)(k) stay close to a data- and architecture-dependent limit K(H), and a strictly positive minimum eigenvalue of K(H) guarantees linear convergence.
Skip connections in ResNet stabilize perturbations, avoiding exponential depth dependence in width requirements and enabling polynomial dependency on depth.
The results hold for smooth activations (e.g., softplus) and analytic non-polynomial activations, under random Gaussian initialization and quadratic loss.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.