[论文解读] Deep Exploration of Epoch-wise Double Descent in Noisy Data: Signal Separation, Large Activation, and Benign Overfitting
该论文在 CIFAR-10 上使用三种多层感知机,带有 30% 标签噪声,按 epoch 研究筒内双下降,揭示良性过拟合、清洁/噪声数据信号分离以及浅层大激活的出现。通过详细的内部信号分析,将深层双下降、良性过拟合与大激活联系起来。
Deep double descent is one of the key phenomena underlying the generalization capability of deep learning models. In this study, epoch-wise double descent, which is delayed generalization following overfitting, was empirically investigated by focusing on the evolution of internal structures. Fully connected neural networks of three different sizes were trained on the CIFAR-10 dataset with 30% label noise. By decomposing the loss curves into signal contributions from clean and noisy training data, the epoch-wise evolutions of internal signals were analyzed separately. Three main findings were obtained from this analysis. First, the model achieved strong re-generalization on test data even after perfectly fitting noisy training data during the double descent phase, corresponding to a "benign overfitting" state. Second, noisy data were learned after clean data, and as learning progressed, their corresponding internal activations became increasingly separated in outer layers; this enabled the model to overfit only noisy data. Third, a single, very large activation emerged in the shallow layer across all models; this phenomenon is referred as "outliers," "massive activa-tions," and "super activations" in recent large language models and evolves with re-generalization. The magnitude of large activation correlated with input patterns but not with output patterns. These empirical findings directly link the recent key phenomena of "deep double descent," "benign overfitting," and "large activation", and support the proposal of a novel scenario for understanding deep double descent.
研究动机与目标
- 在简单前馈网络训练过程中研究带标签噪声时的按 epoch 双下降。
- 分析内部表征以理解清洁数据和噪声数据信号之间的分离。
- 确定隐藏层激活如何演变并在存在噪声时对泛化的贡献。
- 探索浅层大激活与再泛化之间的关系。
- 将观察到的现象与良性过拟合及信号压缩动力学联系起来。
提出的方法
- 在 CIFAR-10 上使用 Adam 优化器和标准超参数对完全连接的网络(MLP7、MLP5、MLP3)进行 30% 标签噪声的训练。
- 将训练损失和准确度分解为在干净数据和在有噪声数据上的分量(同时评估有噪声和干净标签)。
- 在各隐藏层中计算干净数据与噪声数据的平均激活向量之间的夹角余弦相似度,以量化每个 epoch 的信号分离。
- 跟踪激活强度随 epoch 的演化,以识别浅层中的大激活。
- 分析测试数据信号相对于干净/噪声训练信号的关系,以推断正确预测样本与错误预测样本的不同处理路径。
实验结果
研究问题
- RQ1在不同模型规模的情况下,在带噪 CIFAR-10 数据上训练时,按 epoch 的双下降表现为何?
- RQ2培训过程中内部表示是否分离干净和噪声数据信号,以及这如何与泛化相关?
- RQ3浅层的大激活在再泛化与良性过拟合中的作用是什么?
- RQ4信号分离与大激活如何与测试在干净输入和噪声输入上的性能相关?
主要发现
- MLP7 在测试损失上表现出按 epoch 的双下降,而 MLP5 和 MLP3 未出现。
- 随着训练进行,干净数据和噪声数据在外层(更深的层)中的内部信号变得越来越可分离。
- 在双下降出现的前后,浅层出现的大激活与输入模式而非标签相关,且有助于再泛化。
- 尽管对干净与噪声训练数据都能完全拟合,模型仍达到良性过拟合状态,测试性能提升。
- 信号分离在较大模型中更强,并且与成功学习噪声数据而不牺牲泛化相关。
- 正确预测的测试数据与干净训练信号高度一致,而错误预测的数据更与噪声信号一致,表明存在不同的处理路径。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。