QUICK REVIEW

[论文解读] A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Samuel Smith, Quoc V. Le|arXiv (Cornell University)|Oct 17, 2017

Stochastic Gradient Optimization Techniques被引用 24

一句话总结

本文提出贝叶斯证据可解释为何随机梯度下降（SGD）泛化性能良好：它会惩罚尖锐的极小值，同时对模型参数化保持不变。作者推导出噪声尺度 $ g \approx \epsilon N / B $，表明存在一个最优小批量大小 $ B_{\text{opt}} \propto \epsilon N $，可使测试准确率最大化，该结论已在不同学习率、小批量大小和训练集大小下得到实证验证。

ABSTRACT

We consider two questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work responds to Zhang et al. (2016), who showed deep neural networks can easily memorize randomly labeled training data, despite generalizing well on real labels of the same inputs. We show that the same phenomenon occurs in small linear models. These observations are explained by the Bayesian evidence, which penalizes sharp minima but is invariant to model parameterization. We also demonstrate that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. We propose that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we identify the "noise scale" $g = ε(\frac{N}{B} - 1) \approx εN/B$, where $ε$ is the learning rate, $N$ the training set size and $B$ the batch size. Consequently the optimum batch size is proportional to both the learning rate and the size of the training set, $B_{opt} \propto εN$. We verify these predictions empirically.

研究动机与目标

解释为何尽管模型能记忆随机标签，随机梯度下降（SGD）仍能找到泛化性能良好的极小值。
解决Zhang等人（2016年）提出的悖论：深度网络虽能记忆随机标签，却能在真实数据上实现良好泛化。
阐明小批量噪声在引导SGD趋向高贝叶斯证据极小值中的作用。
推导并验证最优小批量大小与学习率、训练集大小及动量之间的缩放规律。

提出的方法

使用贝叶斯模型比较评估模型证据，通过与参数化无关的奥卡姆因子惩罚尖锐极小值。
将SGD建模为带有噪声尺度 $ g \approx \epsilon N / B $ 的随机微分方程，其中 $ \epsilon $ 为学习率，$ N $ 为训练集大小，$ B $ 为小批量大小。
通过平衡噪声驱动的探索与向高证据极小值收敛的过程，推导出最优小批量大小 $ B_{\text{opt}} \propto \epsilon N $。
将分析扩展至带动量的SGD，推导出 $ g \approx \frac{\epsilon N}{B(1 - m)} $ 和 $ B_{\text{opt}} \propto \frac{1}{1 - m} $，其中 $ m $ 为动量系数。
通过在不同学习率、小批量大小、训练集大小和动量值下进行实证验证，确认缩放规律的有效性。
使用交叉熵损失与L2正则化及高斯先验，计算代价函数 $ C(\omega; M) = H(\omega; M) + \lambda \omega^2 / 2 $，并将其与后验和证据关联起来。

实验结果

研究问题

RQ1为何在模型能记忆随机标签的情况下，SGD训练的模型仍能在真实标签上实现良好泛化？
RQ2SGD中的小批量噪声如何影响泛化极小值的选择？
RQ3SGD中最优小批量大小、学习率与训练集大小之间的关系是什么？
RQ4动量如何影响SGD中的最优小批量大小？
RQ5贝叶斯证据能否解释深度模型与小型线性模型中的泛化现象？

主要发现

Zhang等人（2016年）在深度网络中观察到的记忆随机标签现象，在小型过参数化线性模型中同样存在。
贝叶斯证据可解释泛化：它惩罚尖锐极小值，且对模型参数化保持不变，从而解决了记忆悖论。
存在一个最优小批量大小，可使测试准确率最大化，且其与学习率和训练集大小均呈线性关系：$ B_{\text{opt}} \propto \epsilon N $。
实证结果证实了线性缩放规则 $ B_{\text{opt}} \propto \epsilon N $，峰值测试准确率可稳定维持至 $ \epsilon \sim 3 $，超过该值后离散化误差导致性能下降。
对于带动量的SGD，最优小批量大小与 $ \frac{1}{1 - m} $ 呈比例关系，实证结果与该规则高度一致。
最优小批量大小随训练集大小增加而增大，且随着数据集规模增长，泛化差距减小，支持在数据量增长的生产环境中使用更大小批量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。