QUICK REVIEW

[论文解读] A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Yuanzhi Li|arXiv (Cornell University)|Nov 9, 2018

Reinforcement Learning in Robotics参考文献 55被引用 627

一句话总结

该论文证明了在从随机初始化开始，使用 SGD/梯度下降训练的过参数化深度神经网络，在多项式时间内在温和假设下实现零训练误差（或100%训练准确度），方法是展示初始化大邻域内的近似凸性与NTK等价性。

ABSTRACT

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $ extit{global minima}$ on the training objective of DNNs in $ extit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $ extit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

研究动机与目标

激发对为什么深度网络在一阶方法下在实际中取得成功的理论理解，尽管目标函数是非凸且非平滑的。
证明过参数化的深度网络可以从随机初始化在多项式时间内训练到零训练误差。
将过参数化理论从两层扩展到多层网络，包括 ReLU 激活以及各种体系结构。
在有限多项宽度下建立过参数化网络与神经切线核（NTK）之间的联系。
在对数据假设温和的前提下，提供一个适用于全连接、CNN 和残差网络架构的框架。

提出的方法

在 ℓ2 回归下分析带 ReLU 激活的 L 层全连接网络的训练动力学（并可扩展到其他损失）。
证明在接近随机初始化的区域，目标函数几乎是凸的并且半光滑，从而使 SGD/GD 能以多项式时间收敛。
在有限宽度（m = poly(L)）下，证明过参数化网络与 NTK 的等价性，而非无限宽度。
推导梯度公式与反向传播结构，使用符号矩阵 D_i,ℓ 来处理 ReLU 的非光滑性。
证明在跨越 L 层的前向/反向传播保持受控（不存在指数级梯度爆炸或消失）。
提供对小扰动的稳定性分析，并讨论通过 NTK 行为对泛化的影响。

实验结果

研究问题

RQ1在温和的过参数化和非退化数据下，能否通过从随机初始化用 SGD 训练的深度神经网络实现零训练误差？
RQ2隐藏宽度需要多大（以 n、L 和数据分离 δ 的多项式形式），才能保证多项式时间收敛？
RQ3多层网络在接近随机初始化的邻域内，训练景观是否呈现近似凸性和半光滑性？
RQ4是否存在与无限宽度结果类似的有限宽度下过参数化网络与 NTK 的等价性？
RQ5这些结果是否扩展到带 ReLU 激活的 CNN 和 ResNet，以及除了平方损失以外的各种损失函数？

主要发现

梯度下降在回归任务中以多项式(n,L,δ^{-1}) 次迭代找到一个 ε-误差全局极小值，前提是宽度 m ≥ poly(n,L,δ^{-1})·d。
在合适的学习率和小批量大小下，SGD 在 poly(n,L,δ^{-1})·log^2 m 次迭代内达到相同的训练误差目标。
在接近随机初始化处，目标函数几乎是凸的且半光滑，排除了更差的鞍点并实现 guaranteed descent。
在有限宽度设定下，过参数化网络与 NTK 之间存在多项式宽度的等价性（不仅在无限宽度时）。
该分析处理非光滑的 ReLU 激活，并扩展到 CNN 和 ResNet，结果具有广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。