QUICK REVIEW

[论文解读] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere|arXiv (Cornell University)|Sep 15, 2016

Stochastic Gradient Optimization Techniques参考文献 34被引用 577

一句话总结

该论文表明大批量 SGD 趋于收敛到尖锐极小值，导致泛化差距，而小批量方法发现更平坦的极小值；梯度噪声有助于大批量方法探索并有望缩小差距。

ABSTRACT

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

研究动机与目标

激发并量化在深度学习中使用大 mini-batch 的 SGD 时观察到的泛化差距。
研究大批量方法是否收敛到尖锐极小值，以及这如何与较差的泛化相关。
在多种网络架构中比较小批量与大批量训练找到的极小值。
提供潜在的改进策略和实用见解，以在不牺牲泛化的前提下改进大批量训练。

提出的方法

定义 SB（小批量）和 LB（大批量）训练方案，并使用 ADAM 比较它们在六个网络/数据集配置上的行为。
使用基于局部邻域扰动的尖锐性/敏感度度量来表征极小值。
在 SB 和 LB 解之间的直线上产生参数化曲线图，以说明极小值的尖锐性。
进行热启动实验，以测试 SB 探索如何影响 LB 的结果。
分析批量大小阈值及其对泛化与尖锐性的影响。

实验结果

研究问题

RQ1Does large-batch training lead to sharp minimizers that degrade generalization?
RQ2How do SB and LB minimizers differ in terms of sharpness and local landscape structure?
RQ3Can gradient noise from SB training help LB methods escape sharp basins and improve generalization?
RQ4What practical strategies might mitigate the generalization drop associated with LB training?

主要发现

LB 方法收敛到尖锐极小值，其特征是海森矩阵特征值显著增大且泛化能力下降。
SB 方法收敛到更平坦的极小值，具有许多较小的特征值，泛化更好。
参数化和子空间尖锐性分析表明，多网络上 LB 极小值显著比 SB 极小值尖锐。
热启动实验表明，如果在足够的 SB 探索之后再开始 LB，SB 的探索可使 LB 达到平坦极小值。
存在一个阈值批量大小，超过该阈值后，LB 在若干网络的测试准确度性能下降。
如数据增强和对抗性训练等改进在一定程度上提升了 LB 的泛化，但并未完全消除尖锐极小值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。