QUICK REVIEW

[论文解读] To understand deep learning we need to understand kernel learning

Mikhail Belkin, Siyuan Ma|arXiv (Cornell University)|Feb 5, 2018

Face and Expression Recognition参考文献 28被引用 110

一句话总结

该论文表明，过拟合和插值的核方法在真实数据和合成数据上具有良好泛化能力，且与深度网络存在相似性，而现有的泛化界限无法解释这种行为。

ABSTRACT

Generalization performance of classifiers in deep learning has recently become a subject of intense study. Deep models, typically over-parametrized, tend to fit the training data exactly. Despite this "overfitting", they perform well on test data, a phenomenon not yet fully understood. The first point of our paper is that strong performance of overfitted classifiers is not a unique feature of deep learning. Using six real-world and two synthetic datasets, we establish experimentally that kernel machines trained to have zero classification or near zero regression error perform very well on test data, even when the labels are corrupted with a high level of noise. We proceed to give a lower bound on the norm of zero loss solutions for smooth kernels, showing that they increase nearly exponentially with data size. We point out that this is difficult to reconcile with the existing generalization bounds. Moreover, none of the bounds produce non-trivial results for interpolating solutions. Second, we show experimentally that (non-smooth) Laplacian kernels easily fit random labels, a finding that parallels results for ReLU neural networks. In contrast, fitting noisy data requires many more epochs for smooth Gaussian kernels. Similar performance of overfitted Laplacian and Gaussian classifiers on test, suggests that generalization is tied to the properties of the kernel function rather than the optimization process. Certain key phenomena of deep learning are manifested similarly in kernel methods in the modern "overfitted" regime. The combination of the experimental and theoretical results presented in this paper indicates a need for new theoretical ideas for understanding properties of classical kernel methods. We argue that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood.

研究动机与目标

证明过拟合/插值的核分类器在多样化数据集上也能实现良好泛化。
显示非光滑核（Laplacian）能够拟合随机标签，而高斯核更难拟合但测试性能相似。
在非零标签噪声条件下，理论下界表明插值解的 RKHS 范数随着数据量显著增长。
论证当前的核方法/泛化界限无法描述插值核的行为，需要新的理论。
强调核结构与泛化之间的关系，与优化过程的动力学无关。

提出的方法

在 RKHS 中使用核机器，将线性回归推广到无限维空间，采用高斯核和拉普拉斯核。
通过 Representer Theorem 构造插值解，并求解 alpha，使 K alpha = y (Eq. 2)。
在多个数据集上比较过拟合（零分类误差）和插值（零回归损失）解。
使用 EigenPro-SGD 作为加速的核学习方法以达到零分类误差。
从理论上推导下界，表明在 t-overfitting并且存在非零标签噪声时，RKHS 范数必须随着数据量近似指数级增长。

实验结果

研究问题

RQ1过拟合/插值核方法是否在真实世界和合成数据集上具有良好泛化？
RQ2平滑的 (Gaussian) 与非平滑的 (Laplacian) 内核在拟合带噪声或随机标签以及测试性能方面有何差异？
RQ3为何现有的泛化界限无法解释插值核分类器的性能？能够更好描述它的理论应是什么？

主要发现

插值核分类器在六个真实世界数据集和两个合成数据集上，即使在高标签噪声下也能达到近似最优的测试性能。
对这些插值分类器而言，通过早停进行的正则化对测试性能的提升至多只是很小。
非光滑的 Laplacian 内核可以轻易拟合随机标签，类似 ReLU 网络的观察；平滑的高斯内核需要更多的训练轮次来拟合带噪声的数据。
对于平滑内核，过拟合解的 RKHS 范数随着数据量近似指数级增长，与通常依赖范数多项式增长的界限相背离。
尽管有额外的标签噪声，插值核分类器的实验测试性能保持稳定，Laplacian 与 Gaussian 内核之间的表现相似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。