[论文解读] Triple descent and the two kinds of overfitting: Where & why do they appear?
本文识别并解耦了神经网络中两种不同的过拟合现象:在 N=D 处由线性回归中的噪声拟合引起的线性峰值,以及在 N=P 处由非线性模型权重初始化方差引起的非线性峰值。通过随机特征和神经网络模型,研究发现这两种峰值在噪声回归中同时存在,其中非线性会抑制线性峰值而增强非线性峰值,且仅后者可通过正则化或集成方法缓解。
A recent line of research has highlighted the existence of a "double descent" phenomenon in deep learning, whereby increasing the number of training examples $N$ causes the generalization error of neural networks to peak when $N$ is of the same order as the number of parameters $P$. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when $N$ is equal to the input dimension $D$. Since both peaks coincide with the interpolation threshold, they are often conflated in the litterature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at $N\!=\!P$ is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in neural networks). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at $N\!=\!D$ is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep neural networks.
研究动机与目标
- 区分神经网络中与输入维度 D 相关的过拟合与与模型参数 P 相关的过拟合两种类型。
- 研究在相同模型中,两种过拟合峰值——线性峰值(在 N=D 处)和非线性峰值(在 N=P 处)——是否可以共存。
- 理解激活函数的非线性程度如何影响每种峰值的显著性。
- 考察正则化和集成对每种峰值的影响,并判断其是否对两类过拟合产生同等影响。
- 分析峰值形成过程中的时间动态,特别是峰值出现的先后顺序。
提出的方法
- 通过使用不同激活函数的随机特征模型分析测试损失,以分离非线性对过拟合的影响。
- 对测试损失进行偏差-方差分解,将线性峰值归因于噪声拟合,将非线性峰值归因于初始化方差。
- 在随机特征模型中使用岭回归,以解析方式研究 Gram 矩阵的特征谱及其与小特征值的关系。
- 在具有 ReLU、Tanh 和线性激活函数的全连接神经网络上进行数值实验,以验证理论发现。
- 应用正则化(权重衰减)和集成(对多个随机种子取平均)以评估其对两种峰值的差异化影响。
- 追踪训练过程中测试损失的演化,比较两种峰值形成的时间,将其与特征模态的学习速度关联。
实验结果
研究问题
- RQ1N=D 处的线性峰值与 N=P 处的非线性峰值是否为两种独立的过拟合现象?
- RQ2两种峰值是否可以在同一模型中共存?若能,其条件是什么?
- RQ3激活函数的非线性程度如何影响每种峰值的相对强度?
- RQ4正则化或集成是否能同等抑制两种峰值,还是仅抑制其中一种?
- RQ5两种峰值是否在训练过程中以不同时间形成?如果是,原因是什么?
主要发现
- N=D 处的线性峰值完全由标签噪声的过拟合引起,在无噪声情况下消失,证实其源于类似线性回归的行为。
- N=P 处的非线性峰值源于随机特征初始化的方差,即使在无标签噪声时也持续存在,表明其对权重初始化具有根本性敏感性。
- 增加非线性程度(例如从线性到 ReLU 或 Tanh)会因隐式正则化而削弱线性峰值,同时通过增加初始化方差增强非线性峰值。
- 正则化和集成能有效抑制非线性峰值,但对线性峰值影响微乎其微,后者已因非线性而隐式正则化。
- 非线性峰值在训练中形成的时间晚于线性峰值,因为其依赖于 Gram 矩阵中小特征模态的学习,而这些模态收敛较慢。
- 在 (P, N) 参数空间中,两种峰值可共存,导致样本级三重下降曲线,尤其在高噪声水平下更为明显。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。