QUICK REVIEW

[论文解读] Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

Maria Refinetti, Sebastian Goldt|arXiv (Cornell University)|Feb 23, 2021

Neural Networks and Applications参考文献 64被引用 29

一句话总结

The paper shows that two-layer neural networks with only a few hidden neurons can outperform kernel/random-feature learning on a high-dimensional Gaussian mixture task, and provides a closed-set ODE analysis of their training dynamics in the limit D→∞.

ABSTRACT

A recent series of theoretical works showed that the dynamics of neural networks with a certain initialisation are well-captured by kernel methods. Concurrent empirical work demonstrated that kernel methods can come close to the performance of neural networks on some image classification tasks. These results raise the question of whether neural networks only learn successfully if kernels also learn successfully, despite neural networks being more expressive. Here, we show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning on a simple Gaussian mixture classification task. We study the high-dimensional limit where the number of samples is linearly proportional to the input dimension, and show that while small 2LNN achieve near-optimal performance on this task, lazy training approaches such as random features and kernel methods do not. Our analysis is based on the derivation of a closed set of equations that track the learning dynamics of the 2LNN and thus allow to extract the asymptotic performance of the network as a function of signal-to-noise ratio and other hyperparameters. We finally illustrate how over-parametrising the neural network leads to faster convergence, but does not improve its final performance.

研究动机与目标

Motivate and quantify when neural networks can outperform kernel methods on high-dimensional Gaussian mixtures.
Develop a tractable dynamical systems (ODE) framework to capture online SGD training of two-layer neural networks in the high-dimensional limit.
Compare neural networks with random features/kernels in the same regime and identify scaling laws for performance.
Investigate the impact of over-parameterisation on convergence speed and final accuracy.

提出的方法

Derive a closed set of ODEs that track the evolution of order parameters (M, Q) and second-layer weights v during online SGD for a 2LNN with K hidden units.
Reduce the learning dynamics to a high-dimensional limit (D→∞) with N∝D, enabling analytical characterization of PMSE and classification error.
Analyze Gaussian mixtures where inputs are conditional on label, and compute asymptotic performance from fixed points of the ODEs.
Model random features by projecting inputs to P features with a fixed random matrix and training a linear readout; derive RF performance in the high-dimensional limit using eigen-decompositions of the feature covariance.
Compare kernel/ random-feature limits (γ = P/D → ∞) to the 2LNN performance, and examine how performance scales with signal-to-noise ratio and hyperparameters.
Examine the effect of over-parameterisation on convergence probability and final error.

实验结果

研究问题

RQ1Can a small two-layer neural network outperform kernel-based learning on high-dimensional Gaussian mixtures?
RQ2What are the asymptotic learning dynamics of a 2LNN trained by online SGD in the D→∞, N∝D regime?
RQ3How do random features and kernel methods perform relative to 2LNN on the same Gaussian mixture task in the high-dimensional limit?
RQ4How does over-parameterisation affect convergence speed and final generalisation in this setting?

主要发现

A two-layer neural network with a few hidden neurons achieves near-oracle performance on a XOR-like Gaussian mixture, while kernel/random-features methods require much higher SNR to approach that performance.
The dynamics of the 2LNN can be captured by a closed set of ODEs in the high-dimensional limit, enabling analytic prediction of long-time performance.
Random features and kernel methods fail to beat random guessing in the high-dimensional regime unless the sample size scales super-linearly (N=O(D^2)) with the signal-to-noise ratio.
Over-parameterisation increases the probability of converging to near-optimal solutions and accelerates learning, but does not improve the final error when converged.
For random features, the asymptotic error depends on the eigenstructure of the random-feature covariance; in the large-P limit (kernel limit), performance is recovered, but requires large γ and N.
The analysis shows that lazy training regimes (kernel/RF) map mixtures to effectively linear transformations in high dimensions, which preserves non-separability when centers are close

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。