QUICK REVIEW

[论文解读] A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

Umut Şimşekli, Levent Sagun|arXiv (Cornell University)|Jan 17, 2019

Gaussian Processes and Bayesian Inference参考文献 57被引用 69

一句话总结

一篇论文显示深度网络中的随机梯度噪声具有重尾性（α-stable），并将SGD分析为莱维驱动的随机微分方程，实验也证实了非高斯尾部以及两个SGD阶段。

ABSTRACT

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed $\\alpha$-stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima to wider minima, as proven by existing metastability theory. To validate the $\\alpha$-stable assumption, we conduct extensive experiments on common deep learning architectures and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We further investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

研究动机与目标

质疑SGD中的高斯噪声假设及其基于中心极限定理的SDE分析。
提出并验证一个α稳定（重尾）随机梯度噪声模型。
将尾部行为与SGD动力学及通过亚稳态理论发现宽极小点的倾向联系起来。
从经验上表征尾部指数α如何随网络架构、数据集和小批量大小而变化。

提出的方法

对随机梯度噪声采用α稳定（SalphaS）噪声模型，尾部指数为α。
推导莱维驱动SDE作为α<2时SGD的连续时间极限。
使用为α稳定分布设计的尾部指数估计量从梯度噪声样本中估计α。
在MNIST、CIFAR-10、CIFAR-100的FCN和CNN架构上进行广泛实验，变量包括深度、宽度和小批量大小。
分析在莱维噪声下的亚稳态和首次跳跃行为，突出跳跃与两个SGD阶段。

实验结果

研究问题

RQ1深度网络中的随机梯度噪声是否为α稳定（重尾）而非高斯？
RQ2尾部指数α如何随网络规模、架构、数据集和小批量大小变化？
RQ3α稳定噪声对SGD动力学、亚稳态与偏好宽极小点的含义是什么？
RQ4早期迭代动力学是否出现与准确率提升相关的α跳跃？

主要发现

随机梯度噪声在所有配置中都显著偏离高斯分布，呈现重尾特征。
增大小批量对尾部指数α的影响很小。
尾部指数α受架构、数据集和网络规模影响，从而影响SGD动力学。
观察到两阶段SGD行为：α在早期快速下降，然后出现跳跃，随后在准确率提升时稳定下来。
两阶段行为支持亚稳态理论：当α达到最低值时发生跳跃。
对于CIFAR数据集，许多配置中α值在1.0–1.2之间，表明尾部很重。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。