QUICK REVIEW

[Paper Review] Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee|arXiv (Cornell University)|Oct 12, 2018

Stochastic Gradient Optimization Techniques78 references46 citations

TL;DR

The paper shows that with explicit L2 regularization, neural nets can generalize better and learn with as few as O(d) samples, while NTK-based kernels may require Omega(d^2) samples; it also proves polynomial-time convergence for optimization in the infinite-width limit under regularization.

ABSTRACT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $Ω(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

Motivation & Objective

Motivate why over-parameterization and explicit regularization affect generalization beyond NTK analysis.
Demonstrate a concrete data distribution where regularized nets succeed with O(d) samples while NTK fails with Omega(d^2) samples.
Develop theoretical tools linking weak regularization to max-margin solutions and prove margin-based generalization bounds.
Show that infinite-width regularized nets can be optimized in polynomial time via perturbed Wasserstein gradient flow.

Proposed method

Construct a distribution D in d dimensions where signal is concentrated in the first two coordinates.
Analyze a two-layer ReLU network trained with L2-regularized logistic loss versus the NTK kernel induced by the architecture.
Prove that the regualrized NN converges to a max-margin solution (under weak regularization) and generalizes well.
Introduce a perturbed Wasserstein gradient flow and prove polynomial-time convergence to a global minimum for infinite-width networks.

Experimental results

Research questions

RQ1Can explicit L2 regularization enable neural nets to achieve better margins and generalization than the NTK kernel?
RQ2What is the sample complexity gap between regularized neural nets and NTK-based methods on a constructed data distribution?
RQ3Is the regularized global optimum attainable via efficient optimization in the infinite-width limit?
RQ4Does weak regularization push the optimizer toward max-margin solutions across deep architectures?

Key findings

Regularized neural nets achieve good generalization with O(d) samples on the constructed distribution, while NTK requires Omega(d^2) samples.
The global optimizer of weakly-regularized logistic loss attains the max normalized margin among networks of the same architecture.
There is a width-over-parameterization benefit: the maximum possible margin is non-decreasing with network width, improving generalization bounds.
For infinite-width two-layer nets, noisy gradient descent optimizes the regularized loss to a global minimum in polynomial time.
Empirical simulations corroborate improved margin and test accuracy with explicit regularization compared to unregularized nets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.