QUICK REVIEW

[论文解读] On the Origin of Implicit Regularization in Stochastic Gradient Descent

Samuel Smith, Benoît Dherin|arXiv (Cornell University)|Jan 28, 2021

Stochastic Gradient Optimization Techniques参考文献 31被引用 40

一句话总结

论文表明，带有小的有限学习率的 SGD 在改变量损失上的梯度流行为，并通过对小批量结构的回溯误差分析推导出该修改损失。

ABSTRACT

For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization benefit is not explained by convergence bounds, since the learning rate which maximizes test accuracy is often larger than the learning rate which minimizes training loss. To interpret this phenomenon we prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. This modified loss is composed of the original loss function and an implicit regularizer, which penalizes the norms of the minibatch gradients. Under mild assumptions, when the batch size is small the scale of the implicit regularization term is proportional to the ratio of the learning rate to the batch size. We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.

研究动机与目标

激发对使用有限学习率的 SGD 的未解释的一般化收益的动机。
推导一个包含隐式正则化项的 SGD 修改损失，该正则化惩罚小批量梯度范数。
解释在隐式正则化项方面 SGD 与 GD 的差异。
实证验证在损失中包含隐式正则化项可以提升测试准确性。

提出的方法

使用适配于带小批量结构的回溯误差分析，推导一个在一个 epoch 之后的平均 SGD 迭代的修改损失。
表明 SGD 的修改损失为 C(ω) + (ε/4m) ∑_{k=0}^{m-1} ||∇Ĉ_k(ω)||^2，其中 Ĉ_k 是小批量成本。
展开 GD 与 SGD 修改损失之间的关系，以比较梯度和批量规模的影响。
计算一个 epoch 之后的期望 SGD 更新，以识别由小批量排序引起的偏置项。
在修改损失框架内演示学习率和批量大小之间的线性缩放规则。
提供实证证据，明确包含隐式正则化项可以提升测试准确性。

实验结果

研究问题

RQ1带有限学习率的 SGD 是否沿着修改后损失的梯度流路径前进？
RQ2由于小批量结构在 SGD 中产生的隐式正则化项的形式是什么？
RQ3隐式正则化项如何随学习率和批量大小的变化而缩放？
RQ4在训练损失中包含隐式正则化项能否提升一般化能力？
RQ5SGD 与 GD 的修改损失在极小值和轨迹上有何差异？

主要发现

一个 epoch 之后的平均 SGD 迭代接近修改损失的梯度流路径。
修改后的 SGD 损失被详述为 C(ω) + (ε/4m) ∑_{k=0}^{m-1} ||∇Ĉ_k(ω)||^2。
隐式正则化项惩罚小批量梯度的均方范数，尺度约为 ε/(4m)。
如果小批量梯度存在多样性，隐式正则化项的尺度随 ε/B 变化，从而解释批量大小的影响。
明确优化修改后的损失在小学习率下可以提升测试准确性。
实验证据表明，当将隐式正则化项包含在损失中时，可以提升测试性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。