Skip to main content
QUICK REVIEW

[Paper Review] Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

Alon Brutzkus, Amir Globerson|arXiv (Cornell University)|Feb 26, 2017
Stochastic Gradient Optimization Techniques27 references78 citations
TL;DR

The paper proves NP-hardness of learning no-overlap convolutional nets in general, and shows gradient descent globally converges to the optimum for Gaussian inputs; it also shows overlapping filters break global optimality.

ABSTRACT

Deep learning models are often successfully trained using gradient descent, despite the worst case hardness of the underlying non-convex optimization problem. The key question is then under what conditions can one prove that optimization will succeed. Here we provide a strong result of this kind. We consider a neural net with one hidden layer and a convolutional structure with no overlap and a ReLU activation function. For this architecture we show that learning is NP-complete in the general case, but that when the input distribution is Gaussian, gradient descent converges to the global optimum in polynomial time. To the best of our knowledge, this is the first global optimality guarantee of gradient descent on a convolutional neural network with ReLU activations.

Motivation & Objective

  • Motivate and formalize the learning problem for a one-hidden-layer convolutional network with ReLU activations and no overlap.
  • Show hardness results for general data distributions (NP-complete learning).
  • Establish distribution-dependent tractability: gradient descent converges to the global optimum under Gaussian inputs.
  • Characterize differences between non-overlapping and overlapping filter settings.
  • Provide empirical illustrations of the tractability gap between Gaussian and non-Gaussian inputs.

Proposed method

  • Define the network as f(x;w) = (1/k) sum_i sigma(w · x[i]) with no-overlap structure and average pooling.
  • Express population risk ell(w) in terms of g(u,v) = E[ sigma(u·x) sigma(v·x) ] under Gaussian inputs and derive closed-form forms for g and its gradient (Lemmas 3.1 and 3.2).
  • Specialize to No-Overlap Networks to obtain a simplified loss l(w) depending on ||w||, ||w*||, and the angle theta between w and w* (Eq. 8).
  • Prove NP-hardness of learning No-Overlap Networks in the general distribution setting via a reduction from Set-Splitting-by-k-Sets (Theorem 4.2).
  • Prove convergence of gradient descent to near-global optimum under Gaussian inputs, including a characterization of critical points and a high-probability O(1/epsilon^2) iteration bound (Theorem 5.2).
  • Provide empirical demonstrations of tractability gaps and discuss behavior with overlapping filters (Section 6 and 7).

Experimental results

Research questions

  • RQ1Is learning No-Overlap Convolutional Networks with ReLU activations NP-hard under general input distributions?
  • RQ2Under Gaussian input distributions, can gradient descent converge to the global optimum for No-Overlap Networks, and with what complexity?
  • RQ3How does the inclusion of overlapping filters affect the presence of global optima and the behavior of gradient descent?
  • RQ4Do empirical results align with theoretical tractability under Gaussian inputs and hardness in the general case?

Key findings

  • Learning No-Overlap Networks is NP-complete under unrestricted input distributions (reduction from Set-Splitting-by-k-Sets).
  • For Gaussian input distributions, gradient descent converges to the global optimum of the population risk in polynomial time (O(1/epsilon^2) iterations; with high probability).
  • The population loss for No-Overlap Networks has three critical points: a local maximum at w = 0, a unique global minimum at w = w*, and a degenerate saddle; these properties support convergence guarantees.
  • Networks with overlapping filters exhibit non-global local minima with non-trivial suboptimal regions, and random restarts can help empirically recover the global minimum.
  • Empirical experiments show gradient-based optimization succeeds for Gaussian data but can get stuck for non-Gaussian data, illustrating the tractability gap.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.