QUICK REVIEW

[Paper Review] Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach

Grant M. Rotskoff, Eric Vanden‐Eijnden|arXiv (Cornell University)|May 2, 2018

Markov Chains and Monte Carlo Methods36 references96 citations

TL;DR

This paper reframes neural network training as an interacting particle system and proves that, for large network width, the empirical distribution of parameters converges to a global minimum with an error that scales as O(n^{-1}); it also analyzes SGD noise and training guidelines.

ABSTRACT

Neural networks, a central tool in machine learning, have demonstrated remarkable, high fidelity performance on image recognition and classification tasks. These successes evince an ability to accurately represent high dimensional functions, but rigorous results about the approximation error of neural networks after training are few. Here we establish conditions for global convergence of the standard optimization algorithm used in machine learning applications, stochastic gradient descent (SGD), and quantify the scaling of its error with the size of the network. This is done by reinterpreting SGD as the evolution of a particle system with interactions governed by a potential related to the objective or "loss" function used to train the network. We show that, when the number $n$ of units is large, the empirical distribution of the particles descends on a convex landscape towards the global minimum at a rate independent of $n$, with a resulting approximation error that universally scales as $O(n^{-1})$. These properties are established in the form of a Law of Large Numbers and a Central Limit Theorem for the empirical distribution. Our analysis also quantifies the scale and nature of the noise introduced by SGD and provides guidelines for the step size and batch size to use when training a neural network. We illustrate our findings on examples in which we train neural networks to learn the energy function of the continuous 3-spin model on the sphere. The approximation error scales as our analysis predicts in as high a dimension as $d=25$.

Motivation & Objective

Motivate the need for rigorous understanding of neural network approximation error after training.
Introduce an interacting particle system framework to analyze GD/SGD dynamics in wide neural networks.
Show that the empirical distribution of network parameters converges to a global minimum and quantify the scaling of approximation error.
Derive LLN and CLT results for the empirical distribution to characterize fluctuations at finite width.
Provide practical guidelines for step size and batch size in SGD based on the noise structure of the training process.

Proposed method

Represent network parameters as particles with a loss-derived interaction potential.
Derive an evolution equation for the empirical distribution of parameters and show it descends in a convex landscape in the 2-Wasserstein metric.
Establish a Law of Large Numbers: f_t^{(n)} converges to f_t solving a nonlinear Liouville/McKean–Vlasov type equation.
Prove a Central Limit Theorem for fluctuations of f_t^{(n)} around f_t with order O(n^{-1/2}) fluctuations and discuss healing to O(n^{-1}).
Extend analysis to stochastic gradient descent and online SGD, deriving scaling relations for batch size P relative to network width n.
Illustrate results on a high-dimensional spherical 3-spin model with Gaussian kernels and single-hidden-layer networks.

Experimental results

Research questions

RQ1What is the convergence behavior of SGD/GD when the number of network units n is large, and how does the training error scale with n?
RQ2Can the training dynamics be understood via the empirical distribution of parameters, leading to LLN and CLT results?
RQ3How do gradient descent and SGD differ in their noise structure, and what are the practical implications for step size and batch size?
RQ4Does the limiting distributional approach yield universal approximation properties and provide guidance for network design in high dimensions?
RQ5What are the quantitative behaviors of training dynamics in concrete models (e.g., 3-spin on the sphere) and do they agree with the theoretical predictions?

Key findings

The empirical distribution of network parameters converges to a global minimum on a time scale independent of n.
The approximation error scales universally as O(n^{-1}) as n → ∞ in any dimension d.
Fluctuations around the LLN limit are of order O(n^{-1/2}) for finite n and can heal to O(n^{-1}) over long times.
In online SGD with batch size P = O(n^{2α}) for α>0, the LLN and some CLT results still hold; if α∈(0,1) accuracy degrades to O(n^{-α}) but with α≥1 the original rate is recovered.
The framework yields practical guidelines for step size and batch size in SGD to achieve optimal error.
Numerical illustrations with a 3-spin model up to dimension d=25 show the predicted error scaling for both radial basis and single-hidden-layer networks.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.