QUICK REVIEW

[论文解读] What Can ResNet Learn Efficiently, Going Beyond Kernels?

Zeyuan Allen-Zhu, Yuanzhi Li|arXiv (Cornell University)|May 24, 2019

Domain Adaptation and Few-Shot Learning参考文献 39被引用 61

一句话总结

论文表明三层 ResNet 可以高效学习一个分布无关概念类，该类包含通过具有平滑激活函数的较小 ResNet 学到的函数，并且神经网络在泛化和样本效率方面可以超越该类上的核方法。

ABSTRACT

How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class BETTER than kernels? We answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be MUCH SMALLER than ANY kernel method, including neural tangent kernels (NTK). The main intuition is that multi-layer neural networks can implicitly perform hierarchical learning using different layers, which reduces the sample complexity comparing to "one-shot" learning algorithms such as kernel methods. In a follow-up work [2], this theory of hierarchical learning is further strengthened to incorporate the "backward feature correction" process when training deep networks. In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.

研究动机与目标

调查神经网络是否能高效学习一个分布无关概念类，超越核方法所能达到的范围。
在同一任务上比较基于 ResNet 的学习与核方法（包括 NTK）的泛化性能。
展示多层残差结构如何实现分层或前向特征学习，从而降低样本复杂度。
提供分布无关设置下神经网络与核方法之间的理论分离结果。
证明 ResNet 相对于对任意特征映射的线性回归在计算复杂度上的优势。

提出的方法

使用 ReLU 激活的三层残差网络及瓶颈参数化来定义学习者：out(x) = A(σ(Wx + b1) + σ(U σ(Wx + b1) + b2)).
将目标概念类定义为 H(x) = F(x) + α G(F(x))，其中 F 和 G 为两层网络；在不确定性和分布无关设定下进行分析。
证明 SGD 能高效学习该网络，在样本数 N = Õ(CF^2 / δ^2) 下实现总体风险 ≤ δ，与 G(F) 的组成无关。
与核方法做对比，表明存在某些分布，使得任何核在相似或更大样本需求下最多只能达到 δ^2 风险，而 ResNet 以多项式时间达到接近 α^3.9。
给出层次化学习直觉：低层先学习 F 类特征，使高层能够以更少的样本学习 G(F)。

实验结果

研究问题

RQ1神经网络是否能在分布无关设定下严格证明地比核方法更高效地学习一个显著的函数类？
RQ2三层 ResNets（带平滑激活）是否比 NTK 及其他核方法更样本高效地学习 H(x) = F(x) + αG(F(x))？
RQ3在分布无关假设下，SGD 训练的神经网络与核方法在该类上是否存在可证明的泛化差距？
RQ4ResNet 架构是否在计算复杂度上相对于对任意特征映射的线性回归在此问题上具有优势？

主要发现

三层 ResNet 能以样本数 N = Õ(CF^2 / δ^2) 在总体风险 δ 下高效学习概念类 H(x) = F(x) + αG(F(x))，与 G 的组成无关。
存在简单分布，在这些分布下任何核方法的总体风险都不优于 α^2，而 ResNet 以多项式时间达到接近 α^3.9。
ResNet 展示了前向特征学习的归纳偏置，较低层学习更简单的特征以帮助高层捕捉更复杂信号。
已确立 ResNet 相对任意特征映射的线性回归在计算复杂度上的优势。
该工作提供了在分布无关设定下，使用 ReLU 激活的神经网络与核方法之间的首个可证明分离。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。