QUICK REVIEW

[论文解读] Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee|arXiv (Cornell University)|Oct 12, 2018

Stochastic Gradient Optimization Techniques参考文献 78被引用 46

一句话总结

本文显示，使用显式的 L2 正则化，神经网络可以泛化得更好，甚至只需 O(d) 个样本就能学习，而基于 NTK 的核可能需要 Omega(d^2) 个样本；并且在正则化下，证明了在无穷宽度极限下优化的多项式时间收敛。

ABSTRACT

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK). This analysis leads to global convergence results but does not work when there is a standard $\ell_2$ regularizer, which is useful to have in practice. We show that sample efficiency can indeed depend on the presence of the regularizer: we construct a simple distribution in d dimensions which the optimal regularized neural net learns with $O(d)$ samples but the NTK requires $Ω(d^2)$ samples to learn. To prove this, we establish two analysis tools: i) for multi-layer feedforward ReLU nets, we show that the global minimizer of a weakly-regularized cross-entropy loss is the max normalized margin solution among all neural nets, which generalizes well; ii) we develop a new technique for proving lower bounds for kernel methods, which relies on showing that the kernel cannot focus on informative features. Motivated by our generalization results, we study whether the regularized global optimum is attainable. We prove that for infinite-width two-layer nets, noisy gradient descent optimizes the regularized neural net loss to a global minimum in polynomial iterations.

研究动机与目标

激励/说明过参数化与显式正则化为何会影响泛化，超越 NTK 分析。
展示一个具体的数据分布，在该分布下，正则化网络用 O(d) 个样本即可成功，而 NTK 在 Omega(d^2) 样本时失败。
开发理论工具，将弱正则化与最大间隔解联系起来，并证明基于间隔的泛化界限。
证明通过扰动的 Wasserstein 梯度流，无穷宽度的正则化网络可以在多项式时间内被优化到全局最优。

提出的方法

构造一个在 d 维中的分布 D，其中信号集中在前两个坐标。
分析一个由该结构诱导的 NTK 核与使用 L2 正则化的逻辑损失训练的两层 ReLU 网络。
证明带正则化的神经网络在弱正则化下收敛到最大间隔解并具有良好泛化。
引入扰动的 Wasserstein 梯度流，并证明在无穷宽网络下全局最小值的多项式时间收敛。

实验结果

研究问题

RQ1显式的 L2 正则化是否能使神经网络获得比 NTK 核更大的边际和更好的泛化？
RQ2在所构造的数据分布上，正则化神经网络与基于 NTK 的方法之间的样本复杂度差距是多少？
RQ3在无穷宽极限下，正则化全局最优是否可以通过高效优化获得？
RQ4弱正则化是否推动优化器在深层结构中趋向于最大边际解？

主要发现

在所构造的分布上，正则化神经网络以 O(d) 样本实现良好泛化，而 NTK 需要 Omega(d^2) 样本。
弱正则化逻辑损失的全局优化器在同一架构的网络中达到最大归一化边际。
存在宽度-过参数化的优势：最大可能边距随网络宽度单调非下降，从而改善泛化界限。
对于无穷宽的两层网络，带噪声的梯度下降在多项式时间内将带正则化的损失优化到全局最小值。
实证模拟表明，与未正则化网络相比，显式正则化可提升边际和测试准确度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。