[论文解读] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
该论文表明大批量 SGD 趋于收敛到尖锐极小值,导致泛化差距,而小批量方法发现更平坦的极小值;梯度噪声有助于大批量方法探索并有望缩小差距。
The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.
研究动机与目标
- 激发并量化在深度学习中使用大 mini-batch 的 SGD 时观察到的泛化差距。
- 研究大批量方法是否收敛到尖锐极小值,以及这如何与较差的泛化相关。
- 在多种网络架构中比较小批量与大批量训练找到的极小值。
- 提供潜在的改进策略和实用见解,以在不牺牲泛化的前提下改进大批量训练。
提出的方法
- 定义 SB(小批量)和 LB(大批量)训练方案,并使用 ADAM 比较它们在六个网络/数据集配置上的行为。
- 使用基于局部邻域扰动的尖锐性/敏感度度量来表征极小值。
- 在 SB 和 LB 解之间的直线上产生参数化曲线图,以说明极小值的尖锐性。
- 进行热启动实验,以测试 SB 探索如何影响 LB 的结果。
- 分析批量大小阈值及其对泛化与尖锐性的影响。
实验结果
研究问题
- RQ1Does large-batch training lead to sharp minimizers that degrade generalization?
- RQ2How do SB and LB minimizers differ in terms of sharpness and local landscape structure?
- RQ3Can gradient noise from SB training help LB methods escape sharp basins and improve generalization?
- RQ4What practical strategies might mitigate the generalization drop associated with LB training?
主要发现
- LB 方法收敛到尖锐极小值,其特征是海森矩阵特征值显著增大且泛化能力下降。
- SB 方法收敛到更平坦的极小值,具有许多较小的特征值,泛化更好。
- 参数化和子空间尖锐性分析表明,多网络上 LB 极小值显著比 SB 极小值尖锐。
- 热启动实验表明,如果在足够的 SB 探索之后再开始 LB,SB 的探索可使 LB 达到平坦极小值。
- 存在一个阈值批量大小,超过该阈值后,LB 在若干网络的测试准确度性能下降。
- 如数据增强和对抗性训练等改进在一定程度上提升了 LB 的泛化,但并未完全消除尖锐极小值。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。