QUICK REVIEW

[论文解读] Uniform convergence may be unable to explain generalization in deep learning

Vaishnavh Nagarajan, J. Zico Kolter|arXiv (Cornell University)|Feb 13, 2019

Stochastic Gradient Optimization Techniques参考文献 39被引用 42

一句话总结

论文主张基于一致收敛的泛化界在经过参数化的模型通过梯度下降训练时可能是空泛的，并在算法依赖设定下展示了经验与理论上的失效。

ABSTRACT

Aimed at explaining the surprisingly good generalization behavior of overparameterized deep networks, recent works have developed a variety of generalization bounds for deep learning, all based on the fundamental learning-theoretic technique of uniform convergence. While it is well-known that many of these existing bounds are numerically large, through numerous experiments, we bring to light a more concerning aspect of these bounds: in practice, these bounds can {\em increase} with the training dataset size. Guided by our observations, we then present examples of overparameterized linear classifiers and neural networks trained by gradient descent (GD) where uniform convergence provably cannot "explain generalization" -- even if we take into account the implicit bias of GD {\em to the fullest extent possible}. More precisely, even if we consider only the set of classifiers output by GD, which have test errors less than some small $ε$ in our settings, we show that applying (two-sided) uniform convergence on this set of classifiers will yield only a vacuous generalization guarantee larger than $1-ε$. Through these findings, we cast doubt on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

研究动机与目标

说明为何超参数化的深度网络的泛化现象超越经典的一致收敛解释。
在经验上显示常见的一致收敛基于的界在训练集规模增加时会增大，而非如期望的那样减小。
提供理论构造，在考虑梯度下降的隐式偏置时，两边的一致收敛界不能解释泛化。
强调一致收敛作为理解深度学习泛化工具的基本局限性。

提出的方法

对在 MNIST 上使用 SGD（学习率 0.1，批量大小 1）训练的全连接网络（深度 5，宽度 1024）进行权重范数与泛化的经验分析，直到达到 99% 精度，margin gamma* = 10。
观测数据规模 m 增大时与初始化距离和谱范数积的增长方式（分别至少为 m^0.4 和 m）。
评估前期工作中的现有泛化界，显示由于分子项随 m 增大，界的增长率为 Ω(m^0.68)。
在高维线性分类器与通过梯度下降训练的神经网络中给出理论构造，其中双边的一致收敛界被严格证明为空泛。
定义并使用与算法探索的假设集相关的最紧的算法依赖的一致收敛界，以论证一致收敛在解释泛化方面的局限性。

实验结果

研究问题

RQ1一致收敛界能否为通过梯度下降训练的超参数化模型提供非空泛化保证？
RQ2在实践中，基于权重范数的量量（用于许多界）是否随训练集规模的增大而减小，与观察到的泛化性能相符？
RQ3在现实的深度学习设置中，算法依赖的（最紧的）一致收敛界是否仍然是空泛的？
RQ4一致收敛在捕捉超参数化神经网络的泛化行为方面存在哪些根本局限性？

主要发现

如距离初始化的权重范数与谱范数乘积等权重范数会随训练集规模 m 增大而增长（多项式增长：至少为 m^0.4 和 m）。
泛化测试误差随 m 减小（在特定设定下大致为 1/m^0.43），但相应的界的分子项随 m 增长，导致界变大（Ω(m^0.68)）。
即使对算法探索的最小假设类进行剪枝（最紧的统一收敛），也得到几乎空泛的泛化保证（对于小 ε，界接近 1）。
双边的一致收敛界在解释超参数化的线性分类器和通过 GD/SGD 训练的神经网络的泛化方面失败，即使纳入隐式正则化。
尽管在概念上不同，单边的 PAC-Bayes 界在这些设置中也退化为几乎空泛的保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。