[论文解读] Understanding overfitting peaks in generalization error: Analytical risk curves for $l_2$ and $l_1$ penalized interpolation
本文提出 MiSpaR(Misparametrized Sparse Regression)来解析并推导在高维设置下对 $l_2$ 和 $l_1$ 惩罚的插值训练误差与泛化误差曲线,强调过拟合峰值并不严格划分经典与现代范畴,并展示何时各惩罚项能实现良好泛化。
Traditionally in regression one minimizes the number of fitting parameters or uses smoothing/regularization to trade training (TE) and generalization error (GE). Driving TE to zero by increasing fitting degrees of freedom (dof) is expected to increase GE. However modern big-data approaches, including deep nets, seem to over-parametrize and send TE to zero (data interpolation) without impacting GE. Overparametrization has the benefit that global minima of the empirical loss function proliferate and become easier to find. These phenomena have drawn theoretical attention. Regression and classification algorithms have been shown that interpolate data but also generalize optimally. An interesting related phenomenon has been noted: the existence of non-monotonic risk curves, with a peak in GE with increasing dof. It was suggested that this peak separates a classical regime from a modern regime where over-parametrization improves performance. Similar over-fitting peaks were reported previously (statistical physics approach to learning) and attributed to increased fitting model flexibility. We introduce a generative and fitting model pair ("Misparametrized Sparse Regression" or MiSpaR) and show that the overfitting peak can be dissociated from the point at which the fitting function gains enough dof's to match the data generative model and thus provides good generalization. This complicates the interpretation of overfitting peaks as separating a "classical" from a "modern" regime. Data interpolation itself cannot guarantee good generalization: we need to study the interpolation with different penalty terms. We present analytical formulae for GE curves for MiSpaR with $l_2$ and $l_1$ penalties, in the interpolating limit $λ ightarrow 0$.These risk curves exhibit important differences and help elucidate the underlying phenomena.
研究动机与目标
- 引入 Misparametrized Sparse Regression (MiSpaR) 框架,以将测量、模型参数和拟合自由度分离。
- 在插值极限下推导 $l_2$ 和 $l_1$ 惩罚下的训练误差和泛化误差的解析表达式。
- 展示过拟合峰值与插值点之间的关系,以及稀疏性和噪声如何影响泛化。
- 比较脊回归 ($l_2$) 与稀疏惩罚 ($l_1$),以说明在过参数化情形下何时正则化会提升或削弱泛化。
提出的方法
- 提出 MiSpaR,生成模型中推断参数数量 $p$ 可以与生成参数数量 $n$ 以及测量值 $m$ 不同的情形。
- 推导在 $m,p,n\to\infty$ 的高维渐近式,固定比率 $\mu=p/m$ 与 $\alpha=m/n$,以获得 $l_2$ 回归的解析 TE 和 GE。
- 给出 $l_1$ 惩罚的解析 GE 表达,并在插值极限给出一组数值解的非线性方程对。
- 展示有效噪声如何在欠采样/过采样 ($\alpha$, $\mu$) 与稀疏性 ($\rho$) 下在两种惩罚下发生改变。
- 使用自平均化论证和随机矩阵理论(Marchenko-Pastur 分布)来计算 GE/TE 表达式所需的和。
实验结果
研究问题
- RQ1在 $l_2$ 与 $l_1$ 惩罚下,当拟合数据时误参数化和稀疏性如何影响训练误差与泛化误差?
- RQ2过拟合峰值在与数据插值点 ($\mu=1$) 的关系以及良好泛化的范畴(例如 $\mu\alpha=1$)的边界处出现在哪里?
- RQ3在高度过参数化且噪声较小、稀疏性较强的情形下,$l_2$ 与 $l_1$ 惩罚在泛化能力上有何不同?
- RQ4在两种惩罚下,插值极限中 GE 和 TE 的精确解析形式是什么,它们如何依赖于 $\alpha$、$\mu$ 与 $\rho$?
主要发现
- 在插值极限 ($\lambda\to 0$) 下,过拟合峰值在两种惩罚下均出现在 $\mu=1$,但良好泛化可从 $\mu\alpha=1$ 开始,而不是在插值点。
- 在高度过参数化时,泛化消失($GE(\mu\to\infty)=1$)对于两种惩罚都成立,然而当 $\sigma^2$ 与 $\rho$ 较小时,稀疏的 $l_1$ 在相当宽的 $\mu$ 范围内仍能良好泛化。
- 在高过参数化、噪声较小、强稀疏性区域,$l_1$ 与 $l_2$ 之间存在显著性能差距,其中 $l_1$ 能泛化而 $l_2$ 失败。
- 有限正则化 ($\lambda>0$) 能抑制过拟合峰值,表明仅靠插值并不保证良好泛化。
- 对 $l_1$ 的解析 GE 表达涉及将 $\tau$、$\hat{\rho}$ 与 $\sigma_{\xi}$ 联系起来的三个方程组,揭示稀疏回归中的算法相变。
- 该工作表明泛化属性强烈依赖于归纳偏置(惩罚的选择),并非仅由数据插值本身所决定。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。