Skip to main content
QUICK REVIEW

[论文解读] Three Factors Influencing Minima in SGD

Stanisław Jastrzȩbski, Zachary Kenton|arXiv (Cornell University)|Nov 13, 2017
Sexual Differentiation and Disorders参考文献 22被引用 249
一句话总结

本文表明学习率与小批量大小比(LR/BS)和梯度协方差决定SGD极小值的宽度与泛化性能,通过随机微分方程框架分析并在实验中得到验证。

ABSTRACT

We investigate the dynamical and convergent properties of stochastic gradient descent (SGD) applied to Deep Neural Networks (DNNs). Characterizing the relation between learning rate, batch size and the properties of the final minima, such as width or generalization, remains an open question. In order to tackle this problem we investigate the previously proposed approximation of SGD by a stochastic differential equation (SDE). We theoretically argue that three factors - learning rate, batch size and gradient covariance - influence the minima found by SGD. In particular we find that the ratio of learning rate to batch size is a key determinant of SGD dynamics and of the width of the final minima, and that higher values of the ratio lead to wider minima and often better generalization. We confirm these findings experimentally. Further, we include experiments which show that learning rate schedules can be replaced with batch size schedules and that the ratio of learning rate to batch size is an important factor influencing the memorization process.

研究动机与目标

  • Investigate how SGD dynamics and the geometry of final minima depend on LR, BS, and gradient covariance in deep nets.
  • Show that LR/BS ratio is the key determinant of minimum width and generalization.
  • Demonstrate invariance of SGD dynamics under rescaling of learning rate and batch size that preserves LR/BS.
  • Explore replacing learning rate schedules with batch size schedules without loss of performance.
  • Examine memorization dynamics and how LR/BS influences it.

提出的方法

  • Model SGD as a discretized Euler-Maruyama approximation of a stochastic differential equation with noise variance proportional to eta/S.
  • Derive a relation between LR/BS and the trace of the Hessian under a quadratic (OU) approximation of the loss near minima.
  • Perform a change of variables using the Hessian/gradient covariance eigenstructure to analyze stationary distributions.
  • Empirically validate with architectures like VGG11 and ResNet on CIFAR-10 and with MLPs on Fashion-MNIST and CIFAR-10, measuring Hessian-related quantities and generalization.
  • Compare isotropic versus anisotropic gradient covariance scenarios to illustrate the LR/BS effect on equilibration and minima selection.

实验结果

研究问题

  • RQ1How does the SGD path and final minima depend on learning rate, batch size, and gradient covariance?
  • RQ2Is SGD dynamics determined primarily by the LR/BS ratio across different hyperparameter settings?
  • RQ3Does increasing LR/BS lead to wider minima and improved generalization in DNNs?
  • RQ4Can learning rate schedules be replaced by batch size schedules without sacrificing performance?
  • RQ5How does LR/BS influence memorization and overfitting behavior during training?

主要发现

  • SGD dynamics and the final minima are governed by the LR/BS ratio rather than LR or BS alone.
  • Higher LR/BS tends to yield wider minima and often better generalization.
  • SDG dynamics with the same LR/BS ratio are approximately the same across different rescaled hyperparameters, corresponding to the same underlying SDE/OU process.
  • Under a quadratic loss approximation, the expected loss at a minimum scales with eta/S and the Hessian trace, linking noise level to minimum width.
  • Experiments show that larger LR/BS correlates with lower Hessian eigenvalues and Frobenius norm, and better validation performance.
  • Learning rate schedules can be effectively replaced by batch size schedules while preserving learning dynamics.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。