QUICK REVIEW

[论文解读] Coupling Adaptive Batch Sizes with Learning Rates

Lukas Balles, Javier Romero|arXiv (Cornell University)|Dec 15, 2016

Stochastic Gradient Optimization Techniques参考文献 14被引用 24

一句话总结

该论文提出CABS（耦合自适应批量大小），一种基于实时梯度方差估计动态调整随机梯度下降批量大小的方法，同时将其与学习率直接耦合。该方法在无需递减学习率调度的情况下减少了优化方差，从而在图像分类基准上实现了更快的收敛速度，并降低了对学习率超参数调优的敏感性。

ABSTRACT

Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

研究动机与目标

为解决随机梯度下降中优化稳定性与效率之间的平衡挑战，通过动态调整批量大小。
通过梯度方差将学习率与批量大小耦合，消除对人工调优的递减学习率调度的需求。
通过降低对学习率选择的敏感性，简化深度学习中的超参数调优。
在保持或提升标准基准上泛化性能的同时，加快训练收敛速度。

提出的方法

CABS使用一个小批量估算梯度协方差矩阵的对角线（即每个参数的方差），以近似真实的梯度方差。
基于方差、学习率与收敛性之间的理论关系，根据当前目标函数值和学习率，动态按比例增加批量大小。
该方法使用闭式解来确定每一步的最优批量大小，以最大化单位成本的预期进展。
通过使梯度估计中的噪声水平与学习率成正比，将学习率与批量大小耦合，从而稳定优化过程。
该算法在TensorFlow中实现，除初始学习率外无需额外超参数。

实验结果

研究问题

RQ1基于实时梯度方差估计的动态批量大小调整是否能改善深度学习中的优化收敛？
RQ2将批量大小与学习率耦合是否能消除对递减学习率调度的需求？
RQ3CABS能否降低训练性能对学习率选择的敏感性？
RQ4在收敛速度和最终准确率方面，CABS与固定批量大小及其他自适应批量大小策略相比如何？

主要发现

在MNIST、SVHN、CIFAR-10和CIFAR-100基准上，CABS的优化收敛速度优于恒定批量大小方法。
该方法显著降低了对学习率超参数调优的依赖，在学习率敏感性实验中，其表现优于恒定和竞争性的自适应批量大小方案。
在所有四个基准上，CABS在训练速度上优于非自适应的大批量大小（如128、512），尽管其平均批量大小更小。
CABS在训练的大部分时间使用最小批量大小（16），并近似线性增加，以适应问题的复杂性。
该方法在保持与所有基线相当的测试准确率的同时，减少了对人工学习率调优的需求。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。