QUICK REVIEW

[论文解读] The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

Karthik Abinav Sankararaman, Soham De|arXiv (Cornell University)|Apr 15, 2019

Stochastic Gradient Optimization Techniques参考文献 72被引用 35

一句话总结

本文将 gradient confusion 定义为分析超参数化网络中 SGD 动态的度量，并显示宽度降低困惑而深度提高困惑；诸如批量归一化和跳跃连接等技术能够缓解深度带来的训练负担。

ABSTRACT

This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.

研究动机与目标

激发并形式化将梯度困惑作为超参数化网络中 SGD 收敛因素的概念。
分析在高斯初始化下，架构选择（宽度、深度）如何影响梯度困惑。
给出将梯度困惑与 SGD 收敛速率和训练速度联系起来的理论界限。
在 CIFAR/MNIST 上对 WRNs、CNNs 和 MLPs 进行实证验证，以将理论与实践联系起来。

提出的方法

将梯度困惑定义为跨小批量的梯度对内积的界限。
在 PL 不等式和 Lipschitz 光滑性条件下，给出具有梯度困惑界限的常学习率 SGD 的收敛结果。
证明在高斯初始化下，梯度困惑随深度增大而增加，随宽度增大而减小。
将结果扩展到权重较小假设和均匀球面数据采样等一般情形。
证明正交初始化可以使深度线性网络的梯度困惑与深度无关。
进行广泛实验，测量梯度余弦相似性和在 WRNs、CNNs、MLPs 上的训练收敛性，以验证理论。

实验结果

研究问题

RQ1梯度困惑如何量化在超参数化网络上进行 SGD 时小批量梯度之间的相互作用？
RQ2在标准高斯初始化下，宽度和深度如何影响梯度困惑？
RQ3像批量归一化和跳跃连接这样的结构性改动是否能降低梯度困惑并提高可训练性？
RQ4结果是否也适用于带正交初始化的线性网络以及非初始化训练情形？
RQ5在常见架构对基准数据集的梯度相似性上出现了哪些经验模式？

主要发现

梯度困惑将架构与 SGD 速度联系起来：更高的困惑会减慢收敛，而更低的困惑会加速收敛。
在高斯初始化下，增加网络深度会提高梯度困惑，而增加宽度会降低它。
批量归一化和跳跃连接共同显著降低极深网络中的梯度困惑，改善可训练性。
对于带正交初始化的深度线性网络，梯度困惑与深度无关。
在 WRNs、CNNs 和 MLPs 上的实验结果显示，较宽的网络训练更快，梯度相似性随宽度趋向于接近零。
理论解释了为何具有残差连接和归一化的结构能够在常数学习率下实现高效训练。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。