QUICK REVIEW

[论文解读] Optimal Regularization Can Mitigate Double Descent

Preetum Nakkiran, Prayaag Venkat|arXiv (Cornell University)|Mar 4, 2020

Sparse and Compressive Sensing Techniques参考文献 44被引用 48

一句话总结

该论文表明经过最优调谐的 L2 正则化（岭回归）在某些线性设定中随数据或模型规模增加可以实现测试性能的单调性，并在实证中将此缓解扩展到更广泛的模型，如神经网络。

ABSTRACT

Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work, we study whether the double-descent phenomenon can be avoided by using optimal regularization. Theoretically, we prove that for certain linear regression models with isotropic data distribution, optimally-tuned $\ell_2$ regularization achieves monotonic test performance as we grow either the sample size or the model size. We also demonstrate empirically that optimally-tuned $\ell_2$ regularization can mitigate double descent for more general models, including neural networks. Our results suggest that it may also be informative to study the test risk scalings of various algorithms in the context of appropriately tuned regularization.

研究动机与目标

Motivate and define the double descent phenomenon across data and model growth.
Investigate whether optimal L2 regularization can produce monotone test risk in high-dimensional linear regression.
Extend the analysis to model-wise monotonicity under projections and random feature settings.
Provide empirical evidence showing monotonicity with optimal regularization in neural networks and CNNs.
Discuss limitations, counterexamples, and potential extensions to general covariance structures.

提出的方法

Analyze ridge regression in a high-dimensional linear model with isotropic Gaussian covariates and well-specified linear truth.
Derive the optimal ridge parameter lambda_opt and show it is independent of sample size n in the isotropic setting (Lemma 2).
Prove sample-wise monotonicity: increasing n never increases the expected test risk when using the optimally tuned ridge (Theorem 1).
Show model-wise monotonicity for a setting with random projection to a fixed model size d, using optimally tuned ridge (Theorem 3).
Provide non-asymptotic arguments based on singular value interlacing and a partial evaluation of risk (Lemmas 1 and 2).
Extend experiments to non-isotropic covariates, random ReLU features, and CNNs to illustrate empirical monotonicity under optimal regularization.
Discuss counterexamples where monotonicity fails and propose adaptive regularization in non-isotropic settings (Section 6).

实验结果

研究问题

RQ1Can optimal L2 regularization remove or mitigate double descent in linear regression?
RQ2Is test performance monotone with increasing data or model size when the regularization strength is optimally tuned?
RQ3Does optimal regularization extend to model-wise double descent under projections to lower-dimensional subspaces?
RQ4How do these monotonicity properties translate to more general covariate structures beyond isotropic Gaussian data?
RQ5What are the empirical implications for using adaptive or data-dependent regularization in neural networks and CNNs?

主要发现

Optimally-tuned ridge regression yields monotonic test performance with increasing samples in isotropic linear regression (sample-wise monotonicity).
The optimal ridge parameter lambda_opt is independent of n in the isotropic setting, and the expected risk can be expressed in a form that facilitates monotonicity arguments.
Under model-size growth with random projection to a d-dimensional subspace, optimally-tuned ridge regression achieves monotone test performance (model-wise monotonicity).
Empirically, optimal L2 regularization mitigates double descent in non-isotropic regression, random ReLU features, and convolutional neural networks.
Counterexamples exist where optimally-regularized ridge regression is not monotonic for certain non-Gaussian or heteroscedastic settings, motivating adaptive regularization approaches.
The work suggests studying test risk scalings of algorithms under appropriately tuned regularization as a path to understanding generalization.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。