QUICK REVIEW

[论文解读] A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

Yiping Lu, Chao Ma|arXiv (Cornell University)|Mar 11, 2020

Stochastic Gradient Optimization Techniques参考文献 48被引用 26

一句话总结

该论文通过将每个残差块视为分布中的粒子，提出了一种用于深度残差网络（ResNets）的新型平均场ODE模型，实现了无需凸性假设的全局收敛性保证。研究表明，在平均场极限下，所有局部极小值均对应零损失，首次通过深度带来的过参数化，在平均场框架下建立了多层网络的全局收敛结果。

ABSTRACT

Training deep neural networks with stochastic gradient descent (SGD) can often achieve zero training loss on real-world tasks although the optimization landscape is known to be highly non-convex. To understand the success of SGD for training deep neural networks, this work presents a mean-field analysis of deep residual networks, based on a line of works that interpret the continuum limit of the deep residual network as an ordinary differential equation when the network capacity tends to infinity. Specifically, we propose a new continuum limit of deep residual networks, which enjoys a good landscape in the sense that every local minimizer is global. This characterization enables us to derive the first global convergence result for multilayer neural networks in the mean-field regime. Furthermore, without assuming the convexity of the loss landscape, our proof relies on a zero-loss assumption at the global minimizer that can be achieved when the model shares a universal approximation property. Key to our result is the observation that a deep residual network resembles a shallow network ensemble, i.e. a two-layer network. We bound the difference between the shallow network and our ResNet model via the adjoint sensitivity method, which enables us to apply existing mean-field analyses of two-layer networks to deep networks. Furthermore, we propose several novel training schemes based on the new continuous model, including one training procedure that switches the order of the residual blocks and results in strong empirical performance on the benchmark datasets.

研究动机与目标

为解决SGD在训练深度ResNets中的经验成功与在非凸、过参数化设置下缺乏可证明的全局收敛性保证之间的理论鸿沟。
构建一个连续的、基于平均场的ODE模型来描述深度ResNets，通过残差块参数分布上的梯度流来捕捉训练动态。
证明尽管存在非凸性，平均场极限下的每个局部极小值均对应一个零损失的全局极小值。
通过深度ResNets与两层过参数化网络集合之间的等价性，启发新的训练方案。
为深度网络优化提供超越“懒惰”或核范式之外的理论基础。

提出的方法

形式化定义深度ResNets的一种新连续极限为平均场ODE：$\dot{X}_{\rho}(x,t) = \int_{\theta} f(X_{\rho}(x,t), \theta) \rho(\theta,t) d\theta$，其中$\rho(\theta,t)$表示深度上残差块参数的分布。
使用伴随敏感性方法，界定深度ResNet与两层过参数化网络之间梯度的差异，表明在损失水平相近时两者梯度相近。
借助已有两层网络的平均场分析，将收敛性保证转移至深度ResNets模型。
提出一种新颖的训练方案，通过排序重排残差块，实现无额外计算成本的性能提升。
证明在平均场模型中，Wasserstein梯度流的全支撑平稳点即使在损失景观非凸的情况下也是全局最优解。
在全局极小值处假设损失为零，该假设在模型具备通用逼近性质时成立。

实验结果

研究问题

RQ1能否构建一个深度ResNets的平均场ODE模型，使其损失景观确保全局收敛，而无需依赖凸性假设？
RQ2深度ResNets的梯度与两层过参数化网络的梯度相比如何？该关系是否能支持全局收敛性保证？
RQ3仅通过深度带来的过参数化——不依赖于‘懒惰’或核范式——是否能为深度网络带来有利的优化景观？
RQ4能否从平均场模型中推导出新的训练方法，以在基准数据集上提升实际性能？
RQ5残差块参数的分布在此类深度ResNets的全局最优性中起何种作用？

主要发现

所提出的深度ResNets平均场ODE模型确保所有局部极小值均对应零损失，表明在给定连续极限下所有局部最优解均为全局最优解。
当损失水平相近时，深度ResNets的梯度与两层过参数化网络的梯度处于常数倍范围内，从而可实现收敛性保证的转移。
首次在平均场框架下，不假设损失景观凸性，建立了多层神经网络的全局收敛结果。
在CIFAR-10和CIFAR-100上的实验结果表明，所提出的平均场训练方案在所有ResNet和ResNeXt架构中均持续优于标准SGD，测试准确率提升范围为0.25%至0.55%。
通过重排残差块的新训练方法在无额外计算成本下实现了更强的实证性能，表明结构重排可增强优化效果。
分析表明，深度ResNets的行为类似于浅层网络的集合，解释了其在高度非凸性下仍具备优越优化特性的原因。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。