Skip to main content
QUICK REVIEW

[Paper Review] Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

Simon S. Du, Jason D. Lee|arXiv (Cornell University)|Dec 3, 2017
Adversarial Robustness in Machine Learning101 citations
TL;DR

Gradient descent with weight normalization can learn a two-layer CNN with non-overlapping patches under Gaussian inputs, despite the presence of a spurious local minimum; multiple random restarts can boost success to high probability.

ABSTRACT

We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\\mathbf{Z}, \\mathbf{w}, \\mathbf{a}) = \\sum_j a_j\\sigma(\\mathbf{w}^T\\mathbf{Z}_j)$, in which both the convolutional weights $\\mathbf{w}$ and the output weights $\\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the same architecture with fixed weights $(\\mathbf{w}^*, \\mathbf{a}^*)$, we prove that with Gaussian input $\\mathbf{Z}$, there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, gradient descent with weight normalization from randomly initialized weights can still be proven to recover the true parameters with constant probability, which can be boosted to probability $1$ with multiple restarts. We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.

Motivation & Objective

  • Motivate understanding of learning dynamics for a two-layer CNN with a non-overlapping convolutional layer.
  • Characterize the optimization landscape, including the existence of spurious local minima.
  • Show that randomly initialized gradient descent can recover true parameters under Gaussian inputs.
  • Provide conditions under which convergence is guaranteed and quantify convergence phases.

Proposed method

  • Model the network as f(Z,w,a)=sum_i a_i sigma(w^T Z_i) with non-overlapping patches and ReLU activation.
  • Reparameterize the first layer with weight normalization: w = v / ||v|| and analyze the loss ell(v,a).
  • Derive the population loss and gradient expressions under Gaussian Z (Theorems 3.1 and 3.2).
  • Prove a two-phase convergence of gradient descent with initialization-based guarantees (Theorems 4.1 and 4.2).
  • Demonstrate the existence of a spurious local minimum and show that certain initializations lead to convergence to it (Theorem 4.3).
  • Provide a probabilistic initialization scheme that yields global convergence with high probability and discuss the role of restarts.

Experimental results

Research questions

  • RQ1Can randomly initialized gradient descent learn the true weights of a one-hidden-layer CNN with Gaussian inputs?
  • RQ2Does the objective have spurious local minima, and can gradient descent still reach the global minimum?
  • RQ3How do initialization and the two-phase dynamics affect convergence speed and success probability?

Key findings

  • There exist initialization regimes under which gradient descent converges to the teacher parameters with constant probability; this can be boosted to probability 1 with multiple restarts.
  • There is a spurious local minimum under the same random initialization scheme, and gradient descent can converge to it under certain conditions.
  • The optimization dynamics exhibit two phases: a slow initial phase followed by a faster linear-rate phase after sufficient progress.
  • The analysis provides explicit population loss and gradient forms dependent on angle between weights and true weights, and on a-transpose-a*.
  • With Gaussian inputs, the results imply a polynomial-time convergence guarantee for randomly initialized gradient descent, given appropriate restarts.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.