[论文解读] Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian
论文提出一种基于 Jacobian 的、数据依赖的理论,展示神经网络通过将学习动力学分解为信息空间(快速、与标签良好对齐)和干扰空间(慢速,可能过拟合)来实现泛化,并证明即使常宽度的网络在结构良好的数据上也能泛化。
Modern neural network architectures often generalize well despite containing many more parameters than the size of the training dataset. This paper explores the generalization capabilities of neural networks trained via gradient descent. We develop a data-dependent optimization and generalization theory which leverages the low-rank structure of the Jacobian matrix associated with the network. Our results help demystify why training and generalization is easier on clean and structured datasets and harder on noisy and unstructured datasets as well as how the network size affects the evolution of the train and test errors during training. Specifically, we use a control knob to split the Jacobian spectum into "information" and "nuisance" spaces associated with the large and small singular values. We show that over the information space learning is fast and one can quickly train a model with zero training loss that can also generalize well. Over the nuisance space training is slower and early stopping can help with generalization at the expense of some bias. We also show that the overall generalization capability of the network is controlled by how well the label vector is aligned with the information space. A key feature of our results is that even constant width neural nets can provably generalize for sufficiently nice datasets. We conduct various numerical experiments on deep networks that corroborate our theoretical findings and demonstrate that: (i) the Jacobian of typical neural networks exhibit low-rank structure with a few large singular values and many small ones leading to a low-dimensional information space, (ii) over the information space learning is fast and most of the label vector falls on this space, and (iii) label noise falls on the nuisance space and impedes optimization/generalization.
研究动机与目标
- 激励并量化通过梯度下降训练的神经网络在过参数化情况下如何泛化。
- 引入通过 Jacobian 谱对学习动力学进行数据相关分解,将其分解为信息空间和干扰空间。
- 展示标签与信息空间的对齐以及低秩 Jacobian 如何在宽度有限的情况下实现强泛化。
- 分析偏差–方差权衡以及网络规模对训练和测试性能的影响。
- 将任意初始化(包括预训练模型)纳入泛化框架。
提出的方法
- 使用网络的 Jacobian 的奇异值分解来定义信息空间和干扰空间。
- 将训练动力学和泛化误差分解为信息空间和干扰空间的贡献。
- 使用一个偏差–方差框架,其中偏差来自于与信息空间的不对齐,方差来自于模型偏离初始化的移动。
- 通过 Multiclass Neural Tangent Kernel (M-NTK),为随机初始化和任意初始化提供有限样本、数据相关的保证(定理 3.2 和 3.3)。
- 在低秩 Jacobian 结构下,宽度可以适中(例如与数据量成对数级别)并仍然实现良好泛化。
实验结果
研究问题
- RQ1梯度下降是否可以通过利用低秩 Jacobian 结构在过参数化网络上实现泛化?
- RQ2标签与 Jacobian 的信息空间对齐如何影响泛化性能?
- RQ3当 Jacobian 实际上是低秩时,网络宽度在泛化中的作用是什么?
- RQ4在基于 Jacobian 的分析下,预训练或任意初始化的模型是否也具备类似的泛化保证?
- RQ5在信息空间与干扰空间的背景下,偏差与方差分量如何分离?
主要发现
- 典型神经网络的 Jacobian 表现出低秩结构,具有少量较大的奇异值和大量较小的奇异值,从而定义了一个低维信息空间。
- 在信息空间上的学习很快,标签向量的大部分位于该空间,能够快速降低训练误差。
- 在干扰空间上的学习较慢,早停有助于泛化,但会带来一些偏差。
- 当标签向量与信息空间对齐良好时,泛化得到提升;对于结构足够清晰的数据,宽度可以是常数或适度。
- 该框架给出数据相关的保证,不需要极宽的网络,结果还扩展到任意初始化包括预训练模型。
- 数值实验支持理论论点,展示信息方向的快速收敛以及干扰方向上较慢、易产生偏差的学习。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。