QUICK REVIEW

[论文解读] Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Elad Hoffer, Itay Hubara|arXiv (Cornell University)|May 24, 2017

Domain Adaptation and Few-Shot Learning参考文献 38被引用 418

一句话总结

本文认为大批量 SGD 的泛化差距归因于更新次数太少，而不是批量大小，并展示学习率缩放、Ghost Batch Normalization 以及阶段自适应如何缩小这一差距。

ABSTRACT

Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

研究动机与目标

动机与刻画神经网络中观察到的大批量训练所导致的泛化差距。
提出一个随机优化模型（在随机势场上的随机游走）来解释训练早期的权重动态。
提出实用方法以缩小差距：学习率缩放、Ghost Batch Normalization（GBN）以及阶段自适应。
在多种网络架构上对 MNIST、CIFAR-10/100 和 ImageNet 进行实证验证。
重新评估训练实践并强调泛化取决于更新次数，而非仅仅批量大小。

提出的方法

将 SGD 模型化为在随机势场上的随机游走，以解释权重的超慢扩散。
推导出权重距离初始化的增长随更新次数对数增长（近似 log t），并将扩散速率与批量大小联系起来。
提出随批量大小成比例进行学习率缩放（η ∝ sqrt(M)）以保留更新统计。
引入 Ghost Batch Normalization，在大批量中对较小的幽灵批次计算 BN 统计量。
主张通过延长训练迭代次数实现跨批量大小的更新次数可比，从而进行阶段自适应。
在标准数据集和网络上进行实证验证，报告在 SB/LB 设定下的准确率提升。

实验结果

研究问题

RQ1在不增加总训练时间的前提下，能否消除大批量训练所观察到的泛化差距？
RQ2哪些机制解释了早期训练中的权重更新如何影响最终泛化，以及批量大小和更新次数如何相互作用？
RQ3如学习率缩放和 Ghost Batch Normalization 等调整是否在不同架构和数据集上稳定地降低或消除泛化差距？
RQ4是否可以通过扩展大批量的训练阶段来达到与小批量相同的泛化性能？

主要发现

通过学习率缩放和 Ghost Batch Normalization， larg e-batches 的泛化差距可以在很大程度上消除。
权重距离初始化随更新次数对数增加，在不同批量大小下一致，表明扩散动力学比批量大小本身更主导泛化。
将学习率按批量大小的平方根缩放有助于保持更新统计并改善泛化。
Ghost Batch Normalization 通过在大批量训练中使用小的幽灵批次来计算批统计量，显著降低泛化误差。
将权重更新次数（阶段自适应）调整以匹配小批次的迭代次数即可消除差距，从而获得可比或更好的验证准确率。
在 MNIST、CIFAR-10/100 和 ImageNet 上的实验显示 +LR、+GBN 和 +RA 能带来稳定提升，往往达到甚至超过 SB 的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。