Skip to main content
QUICK REVIEW

[Paper Review] How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

Zixiang Chen, Yuan Cao|arXiv (Cornell University)|Nov 27, 2019
Stochastic Gradient Optimization Techniques65 references29 citations
TL;DR

This paper establishes that polylogarithmic over-parameterization—specifically, network width growing as a polylogarithmic function of the sample size $ n $ and inverse error $ \epsilon^{-1} $—is sufficient for training deep ReLU networks via gradient descent to achieve global convergence and generalization. The authors introduce a relaxed linear approximation error condition in the NTRF function class, enabling tighter convergence and generalization bounds that match state-of-the-art results for two-layer networks.

ABSTRACT

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target error $ε^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumptions on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, whether deep neural networks can be learned with such a mild over-parameterization is still an open question. In this work, we answer this question affirmatively and establish sharper learning guarantees for deep ReLU networks trained by (stochastic) gradient descent. In specific, under certain assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $ε^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.

Motivation & Objective

  • To resolve the open question of whether deep ReLU networks can be trained with polylogarithmic over-parameterization, akin to recent results for two-layer networks.
  • To improve generalization and optimization guarantees for deep networks under milder over-parameterization conditions than previous works.
  • To extend the NTRF function class framework to deep networks by allowing a constant linear approximation error, rather than requiring near-perfect approximation.
  • To establish tighter sample complexity bounds for GD and SGD in the deep network setting, matching the best-known results for two-layer networks.
  • To generalize theoretical results to scenarios with partial data separability, showing that well-separated data fractions enable efficient learning with minimal over-parameterization.

Proposed method

  • Propose a novel theoretical framework based on the NTRF (Neural Tangent Random Feature) function class, which characterizes functions as linear combinations of random features derived from the network's initial weights.
  • Introduce a relaxed condition allowing a constant linear approximation error between the true network and its linearization at initialization, rather than requiring high-accuracy approximation.
  • Analyze gradient descent (GD) and stochastic gradient descent (SGD) under this relaxed condition, proving global convergence to zero training error for sufficiently wide networks.
  • Derive generalization bounds using Rademacher complexity, showing that the statistical error diminishishes with increasing width $ m $, even when $ m \in \widetilde{\Omega}(1) $, not necessarily $ m \gg n $.
  • Establish sample complexity bounds of $ \widetilde{\mathcal{O}}(\epsilon^{-2}) $ for GD and $ \widetilde{\mathcal{O}}(\epsilon^{-1}) $ for SGD, which are tighter than prior deep network results and match two-layer state-of-the-art bounds.
  • Extend analysis to data with partial separability, showing that when a large fraction of data are well-separated, the NTRF function class with radius $ R = \widetilde{\mathcal{O}}(1) $ can achieve $ \epsilon $-error generalization.

Experimental results

Research questions

  • RQ1Can deep ReLU networks be trained with polylogarithmic over-parameterization, similar to recent results for two-layer ReLU networks?
  • RQ2Does allowing a constant linear approximation error (rather than high-accuracy approximation) still enable global convergence and generalization in deep networks?
  • RQ3Can tighter generalization and convergence bounds be derived for GD and SGD in deep ReLU networks under milder width requirements?
  • RQ4How does the theoretical framework extend to data with partial separability, and what width is required to achieve $ \epsilon $-generalization?
  • RQ5Do the derived sample complexity bounds for GD and SGD in deep networks match or improve upon existing bounds, particularly in the two-layer case?

Key findings

  • Polylogarithmic network width—specifically $ m = \text{poly}(R) $, where $ R $ is the radius of the NTRF function class—is sufficient for GD to globally converge and learn deep ReLU networks.
  • The generalization error diminishes for a wide range of widths $ m \in \widetilde{\Omega}(1) $, relaxing the typical requirement that $ m \gg n $ in prior NTK-based analyses.
  • The sample complexity for GD is $ \widetilde{\mathcal{O}}(\epsilon^{-2}) $, and for SGD is $ \widetilde{\mathcal{O}}(\epsilon^{-1}) $, which are tighter than previous bounds and match the best-known results for two-layer ReLU networks.
  • Theoretical guarantees hold even with a constant linear approximation error between the network and its linearization, enabling a significant relaxation of assumptions compared to prior works.
  • When a large fraction of training data are well-separated, the NTRF function class with radius $ R = \widetilde{\mathcal{O}}(1) $ can achieve $ \epsilon $-generalization, demonstrating robustness to data structure.
  • Empirical validation on binary CIFAR-10 subsets shows that the minimum network width required for zero training error grows polylogarithmically with sample size, consistent with theoretical predictions.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.