Skip to main content
QUICK REVIEW

[论文解读] Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Yuan Cao, Quanquan Gu|arXiv (Cornell University)|Feb 4, 2019
Machine Learning and ELM参考文献 90被引用 68
一句话总结

本文推导了在过参数化的深度 ReLU 网络上使用梯度下降的算法相关泛化界,并证明在某些数据假设下,宽网络可以让 GD 获得任意小的泛化误差。

ABSTRACT

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.

研究动机与目标

  • Explain why gradient descent can yield good generalization for over-parameterized deep ReLU networks.
  • Provide algorithm-dependent generalization bounds that improve over uniform convergence bounds.
  • Show convergence of gradient descent to near-initialization global minima under over-parameterization.
  • Analyze two data distribution assumptions under which GD attains epsilon-generalization with polynomially many samples.

提出的方法

  • Study binary classification with L-hidden-layer fully connected ReLU networks trained by gradient descent on cross-entropy loss.
  • Initialize weights Gaussian as in He initialization and run GD to minimize empirical risk.
  • Define a tau-neighborhood around initialization and use Rademacher complexity to bound generalization gap.
  • Introduce empirical and population surrogate errors to relate optimization and generalization performance.
  • Prove convergence of GD to a global minima within the tau-neighborhood under a gradient lower bound condition (Theorem 4.7).
  • Provide two data-distribution assumptions (Separable by Random ReLU Feature and Separable by Conjugate Kernel) with corollaries giving epsilon-generalization bounds.

实验结果

研究问题

  • RQ1Under what data distribution conditions can GD train over-parameterized deep ReLU networks to achieve small generalization error?
  • RQ2How does algorithm-dependent generalization bound scale with network width in the over-parameterized regime?
  • RQ3Can gradient descent converge to a global minima close to initialization for deep ReLU networks, and what are the required width and initialization conditions?
  • RQ4What are concrete implications of assuming separability by random ReLU features or by conjugate kernels for generalization guarantees?

主要发现

  • An informal result shows that with per-layer width m_l = tilde Omega(epsilon^-14) and n = tilde Omega(epsilon^-4), GD with proper initialization achieves population error at most epsilon with high probability.
  • The generalization bound in Theorem 4.5 scales with tau and m such that the bound on the generalization gap is roughly tilde O(tau * sqrt(m/n)) under He initialization, improving width dependency over some prior bounds.
  • Gradient descent is shown to converge to a global minima within a tau-neighborhood (Theorem 4.7) given a gradient lower bound condition that scales with sqrt(m).
  • Corollaries under specific data-distribution assumptions (Separable by Random ReLU Feature or Separable by Conjugate Kernel) yield epsilon-generalization with polynomially many samples: m* = tilde O(poly(2^L, gamma^-1)) * epsilon^-14 and n* = tilde O(poly(2^L, gamma^-1)) * epsilon^-4 in the first case; analogous bounds with gamma^-1 dependencies in the second.
  • The results provide an algorithm-dependent generalization bound for wide neural networks of arbitrary depth, linking optimization dynamics to generalization without the width-independence caveat of uniform convergence bounds.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。