QUICK REVIEW

[论文解读] Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Mingchen Li, Mahdi Soltanolkotabi|arXiv (Cornell University)|Mar 27, 2019

Machine Learning and Data Classification参考文献 47被引用 155

一句话总结

本文证明，在簇可分的数据模型下，参数过度化的一层隐藏网络中，带早停的梯度下降对标签噪声具有鲁棒性，原因在于最终模型保持接近初始化并在需要大幅移动才会过拟合前忽略被污染的标签。

ABSTRACT

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Such neural networks in principle have the capacity to (over)fit any set of labels including pure noise. Despite this, somewhat paradoxically, neural network models trained via first-order methods continue to predict well on yet unseen test data. This paper takes a step towards demystifying this phenomena. Under a rich dataset model, we show that gradient descent is provably robust to noise/corruption on a constant fraction of the labels despite overparameterization. In particular, we prove that: (i) In the first few iterations where the updates are still in the vicinity of the initialization gradient descent only fits to the correct labels essentially ignoring the noisy labels. (ii) to start to overfit to the noisy labels network must stray rather far from from the initialization which can only occur after many more iterations. Together, these results show that gradient descent with early stopping is provably robust to label noise and shed light on the empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting.

研究动机与目标

动机与分析：解释为何使用一阶方法训练的过参数化神经网络在存在标签噪声的情况下仍能泛化良好。
建立理论框架，展示带早停的梯度下降对恒定比例的被污染标签具有鲁棒性。
刻画从初始化的距离如何决定鲁棒性与过拟合之间的关系。
给出在何种条件下早停可防止过拟合并实现正确的标签恢复。

提出的方法

模型：具有 k 个隐藏单元、输出权重固定的一层隐藏层神经网络，对平方损失进行梯度下降训练。
数据：具有 K 个聚类的簇化数据集，类别数最多为 K̄ ≤ K，且每个簇的标签用污染分数 ρ 来定义的噪声/污染标签。
关键工具：由聚类中心 C 和激活函数导数构成的神经网络协方差 Σ(C)，其最小特征值 λ(C) 指示类别可分离性。
证明步长 η = constant × K/n × 1/||C||^2 的梯度下降，在 T 次迭代后在初始化的邻域内获得解，并对靠近簇的输入正确预测真实标签。
证明残差分解为与大奇异子空间对齐的干净残差和位于较小子空间中的噪声残差，从而在早停下实现鲁棒性。
证明要过拟合噪声标签，必须远离初始化，从而将鲁棒性与距离初始化的关系联系起来。

实验结果

研究问题

RQ1带早停的梯度下降是否在存在标签噪声的情况下对过参数化网络能有保证地学习到正确的标签？
RQ2数据几何结构如何通过聚类中心和神经网络协方差 λ(C) 影响对被污染标签的鲁棒性？
RQ3从初始化出发的距离在防止对噪声标签过拟合方面的作用是什么？
RQ4在保留对靠近簇中心输入的正确预测的前提下，能容忍多少标签污染？

主要发现

带早停的梯度下降对恒定比例的被污染标签保持鲁棒性，在靠近簇中心的输入上实现正确的标签预测。
该方法要求最终参数保持接近初始化；距离过远与对噪声标签的过拟合相关。
在给定数据集和网络条件下，鲁棒性以高概率成立，包括对污染程度的上界 ρ ≤ δ/8。
达到鲁棒性的迭代次数适中，随数据几何通过 λ(C) 和 ||C|| 而增长，通常为 O(K)（条件数除外）。
在温和归一化条件下，鲁棒性和最终预测精度与网络规模无关，而取决于簇结构和距离初始化的程度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。