QUICK REVIEW

[논문 리뷰] The generalization error of random features regression: Precise asymptotics and double descent curve

Mei Song, Andrea Montanari|arXiv (Cornell University)|2019. 08. 14.

Random Matrices and Applications참고 문헌 31인용 수 248

한 줄 요약

논문은 random features ridge regression에 대한 정밀한 고차원 해석을 도출하고, double-descent 일반화 곡선을 보여주며, 최적의 테스트 오차는 regularization이 있든 없든 매우 overparameterized 영역에서 발생한다는 것을 보인다.

ABSTRACT

Deep learning methods operate in regimes that defy the traditional statistical mindset. Neural network architectures often contain more parameters than training samples, and are so rich that they can interpolate the observed labels, even if the latter are replaced by pure noise. Despite their huge complexity, the same architectures achieve small generalization error on real data. This phenomenon has been rationalized in terms of a so-called `double descent' curve. As the model complexity increases, the test error follows the usual U-shaped curve at the beginning, first decreasing and then peaking around the interpolation threshold (when the model achieves vanishing training error). However, it descends again as model complexity exceeds this threshold. The global minimum of the test error is found above the interpolation threshold, often in the extreme overparametrization regime in which the number of parameters is much larger than the number of samples. Far from being a peculiar property of deep neural networks, elements of this behavior have been demonstrated in much simpler settings, including linear regression with random covariates. In this paper we consider the problem of learning an unknown function over the $d$-dimensional sphere $\mathbb S^{d-1}$, from $n$ i.i.d. samples $(\boldsymbol x_i, y_i)\in \mathbb S^{d-1} imes \mathbb R$, $i\le n$. We perform ridge regression on $N$ random features of the form $σ(\boldsymbol w_a^{\mathsf T} \boldsymbol x)$, $a\le N$. This can be equivalently described as a two-layers neural network with random first-layer weights. We compute the precise asymptotics of the test error, in the limit $N,n,d o \infty$ with $N/d$ and $n/d$ fixed. This provides the first analytically tractable model that captures all the features of the double descent phenomenon without assuming ad hoc misspecification structures.

연구 동기 및 목표

Motivate and analyze the double descent phenomenon in a nontrivial nonparametric setting of random features regression.
Compute exact asymptotics of test error in a proportional regime where N/d and n/d are fixed.
Characterize how regularization and signal-to-noise ratio affect generalization and the location of the interpolation threshold.

제안 방법

Model the learning problem as ridge regression on N random features with activation sigma, trained on n samples from the d-dimensional sphere.
Derive precise asymptotics of test error R_RF in the limit N,n,d -> infinity with N/d -> psi1 and n/d -> psi2.
Express the prediction error as a function of psi1, psi2, lambda, and data statistics via Stieltjes transforms of a block-structured random matrix.
Show equivalence between random features and a Gaussian covariates model in the asymptotic limit to gain intuition.
Provide special-case simplifications including ridgeless limit and highly overparametrized regimes.
Relate the results to a kernel perspective and discuss self-induced regularization mechanisms.

실험 결과

연구 질문

RQ1What is the precise asymptotic prediction error for random features ridge regression in the proportional high-dimensional limit?
RQ2How do model complexity (N/d and n/d) and regularization (lambda) interact to produce double descent in this nonparametric setting?
RQ3Under what conditions does the random features model exhibit optimal generalization in highly overparametrized regimes?
RQ4Can a Gaussian covariates proxy reproduce the same asymptotic generalization behavior as random features?
RQ5How do linear versus nonlinear target functions affect the asymptotics of the test error?

주요 결과

The paper obtains exact asymptotics for the test error in a proportional regime, capturing all features of the double descent phenomenon.
Above a critical signal-to-noise ratio, minimum test error is achieved by extremely overparametrized interpolators with vanishing training error.
Regularization can either help or hurt depending on the SNR, with an identified phase transition at a critical SNR where optimal lambda shifts.
The ridgeless limit (lambda -> 0) often yields near-interpolators that are statistically optimal in the highly overparametrized regime.
The analysis shows both variance and bias can peak at the interpolation threshold, and that the double descent persists even in noiseless settings.
The model demonstrates that optimal generalization can occur without specific misspecification assumptions and that strong overparameterization is beneficial under suitable conditions.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.