QUICK REVIEW

[论文解读] An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan|arXiv (Cornell University)|Dec 14, 2018

Optimization and Search Problems参考文献 36被引用 139

一句话总结

本文提出梯度噪声尺度作为一个简单的统计量，用以预测在监督学习、强化学习与生成建模任务中最大的有用批量大小，并分析计算效率与时间效率之间的权衡。它在多个领域进行测试，显示噪声尺度在训练过程中以及任务难度增加时会上升。

ABSTRACT

In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.

研究动机与目标

激发并理解为何在不同领域和数据集上批量大小的上限会有所不同。
将梯度噪声尺度引入作为最佳批量大小的实用预测指标。
建立一个将批量大小、梯度噪声与训练效率联系起来的简单理论。
在包括 ImageNet、CIFAR-10、SVHN、MNIST、BillionWord、Atari 和 Dota 的多样任务上经验性验证预测。

提出的方法

将梯度噪声尺度定义并推导为 B_noise = tr(H Σ) / (G^T H G)。
将最优步长与批量大小相关联，公式为 ε_opt(B) = ε_max / (1 + B_noise/B)。
为实际测量定义简化的噪声尺度 B_simple = tr(Σ) / |G|^2。
通过围绕 B_crit ~ B_noise 的双曲线预测训练时间与计算成本之间的帕累托式权衡。
在各任务中测量 B_simple、B_noise 和 B_crit，并跟踪它们在训练过程中的演变。
拟合经验帕累托前沿以评估与模型预测的对齐程度。

实验结果

研究问题

RQ1梯度噪声尺度是什么，以及它如何在各任务中与最优批量大小相关？
RQ2B_simple/B_noise 是否能预测在计算效率收益下降的临界批量大小？
RQ3在训练过程中以及在不同任务类型（监督、强化学习、生成式）中，噪声尺度如何演变？
RQ4学习率和条件数对观察到的批量大小权衡有何影响？
RQ5如理论所预测，动态批量大小调整是否带来效率收益？

主要发现

梯度噪声尺度在跨任务方面大致以数量级水平预测可用的最大批量大小。
训练效率遵循帕累托前沿；在超出噪声尺度后，较大批量的收益减少。
随着模型达到更低的损失，噪声尺度在训练中增加。
B_simple 在许多任务中提供了对 B_crit 的实用估计，而在某些情况下 B_noise 提供了更接近的预测。
对于更复杂的任务（如 RL/Dota），噪声尺度更大，并且随着训练进展而增大。
据预测，在噪声尺度引导下进行动态批量大小调优将提高效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。