QUICK REVIEW

[论文解读] Finite Versus Infinite Neural Networks: an Empirical Study

Jaehoon Lee, Samuel S. Schoenholz|arXiv (Cornell University)|Jul 31, 2020

Advanced Neural Network Applications参考文献 134被引用 39

一句话总结

本论文在不同架构下对有限宽度神经网络及其无限宽度核对应（NNGP/NTK）进行大规模经验比较，揭示何时以及如何打破对应关系，以及如何在两种模式下优化。

ABSTRACT

We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider.

研究动机与目标

量化在各种架构下，何时宽度更大的神经网络收敛到核方法（NNGP/NTK）。
识别保持或破坏有限–无限宽度对应关系的训练做法。
开发实用的最佳实践，以提升有限宽度和无限宽度模型的性能。
探索数据预处理、集成与架构对核方法与有限网络的影响。

提出的方法

在 FCN 和 CNN 架构（VEC、GAP 读取头）上进行系统实验，采用 ReLU，以及标准参数化与 NTK 参数化。
计算并比较精确的 NNGP 与 NTK 核，与通过梯度下降训练的有限宽度网络对比。
应用干预措施：居中、较大学习率、权重衰减、集成、ZCA 白化以及数据增强。
使用 MSE 损失进行直接核对比，并附有关于 softmax-交叉熵差异的说明。
在 CIFAR-10 上评估性能，并对 CIFAR-100 与 Fashion-MNIST 进行鲁棒性检验。

实验结果

研究问题

RQ1在不同架构下，有限宽度网络的准确率与无限宽度核（NNGP/NTK）相比如何？
RQ2哪些训练技巧能保持或打破有限–无限宽度对应？
RQ3哪些实用技术（居中、集成、正则化、预处理）能提升两种模式的性能？
RQ4数据增强与预处理如何影响核方法与有限网络的性能？
RQ5在大规模下影响核方法的局限性（精度、条件数、等变性）有哪些？

主要发现

架构	方法	NTK	NNGP
FC	ZCA Reg (this work)	59.7	59.7
FC	DA Ensemble (this work)	61.5	62.4
CNN-VEC	ZCA Reg (this work)	69.8	69.4
CNN-VEC	DA Ensemble (this work)	70.5	73.2
CNN-GAP	ZCA Reg (this work)	83.2	83.5
CNN-GAP	DA Ensemble (this work)	83.7 (32 ens)	84.8 (32 ens)

当对角正则化被仔细调优时，NNGP 核在图像分类任务中往往优于 NTK 核。
基线无限宽 FCN 和 CNN-VEC 在基线情况下可以优于其有限宽度对应物，而 CNN-GAP 在基线情况下可能表现不佳。
居中和对有限网络进行集成可以降低预测方差，使有限模型在性能上更接近核方法。
大学习率和 L2 正则化可能打破核–有限宽度对应，且受架构与参数化影响。
层级正则化的 L2 通过与 NTK 的有效惩罚对齐，提升标准参数化网络的性能。
对角正则化的核可以模仿提前停止；最佳验证通常在早停和非零对角正则下出现。
浮点精度限制了较大数据集的核性能，数据集规模阈值依赖于特征值衰减（幂律尾）。
经过正则化的 ZCA 白化在各架构上提升有限与核方法的准确性。
等变性的好处局限于远离核 regime 的窄网络；在宽带下，等变性几乎无优势。
集成核预测器可通过 NNGP/NTK 有效增强，并在多种架构上实现 CIFAR-10 核方法的最先进结果。
经过适当技巧的有限宽度 CNN 可以超越核方法（例如带集成的 CNN-GAP），而某些 FCN 集成并未完全缩小差距。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。