QUICK REVIEW

[论文解读] The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

X. Qian, Diego Klabjan|arXiv (Cornell University)|Apr 27, 2020

Stochastic Gradient Optimization Techniques参考文献 35被引用 29

一句话总结

本文从理论上分析了在线性模型和两层线性网络中，小批量大小对随机梯度方差的影响。证明了随着批量大小增加，梯度方差减小，且其为 1/b 的多项式函数，无常数项，并推导出梯度范数与初始权重之间的递归关系，为理解 SGD 的动态行为和泛化性能提供了洞见。

ABSTRACT

The mini-batch stochastic gradient descent (SGD) algorithm is widely used in training machine learning models, in particular deep learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks, by focusing on the variance of the gradients, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size $b$ and thus the variance of the stochastic gradient estimator is a decreasing function of $b$. For deep neural networks with $L_2$ loss we show that the variance of the gradient is a polynomial in $1/b$. The results back the important intuition that smaller batch sizes yield lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide further insights to our results on various datasets and commonly used deep network structures.

研究动机与目标

从理论上分析小批量大小对随机梯度下降（SGD）中梯度方差的影响。
建立在线性回归和两层线性网络中，随着批量大小增加，梯度方差减小的理论关系。
推导梯度范数与初始模型权重之间的递归关系，以支持理论分析。
提供一个理解 SGD 动态行为的框架，超越收敛性，重点关注方差和泛化性能。
通过多个数据集和网络架构的实证验证，确认理论发现的可靠性。

提出的方法

利用梯度估计器的范数性质，对线性回归中的梯度方差进行理论分析。
推导在每次 SGD 迭代中，梯度范数与初始权重之间的递归关系。
证明在使用 L2 损失的两层线性网络中，梯度方差是 1/b 的多项式函数，且首项系数非负。
利用条件方差和矩生成函数，刻画在随机采样下的梯度行为。
通过梯度动态的结构相似性，将结果推广至更深的线性网络。
使用合成数据、MNIST 和 Yelp 数据集进行实证验证，每个配置运行多次以确保统计显著性。

实验结果

研究问题

RQ1在小批量大小增加时，线性模型中随机梯度估计器的方差是否减小？
RQ2在两层线性网络中，梯度方差作为小批量大小的函数，其函数形式是什么？
RQ3初始模型权重与 SGD 迭代中梯度范数之间有何关系？
RQ4对于深层线性网络，梯度估计器的方差是否可表示为 1/b 的多项式？
RQ5由于更高的梯度方差，较小的小批量大小是否会导致更低的训练损失？

主要发现

在线性回归中，任意样本梯度的线性组合的范数是小批量大小 b 的递减函数。
对于具有 L2 损失且输入服从正态分布的两层线性网络，梯度方差是 1/b 的多项式函数，且无常数项，证明其在大 b 时单调减小。
1/b 多项式中首项系数非负，确保在足够大的批量大小下，方差单调减小。
梯度范数与初始权重之间的递归关系，使得基于初始条件可计算任意迭代中的梯度相关量。
在线性回归、两层网络、MNIST 和 XLNet 上的实证结果一致表明，较小的批量大小导致更低的训练损失和更高的梯度方差。
理论框架揭示了梯度方差、批量大小和初始权重之间的结构性依赖关系，为未来 SGD 动态行为分析提供了支持。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。