QUICK REVIEW

[论文解读] The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

Siyuan Ma, Raef Bassily|arXiv (Cornell University)|Dec 18, 2017

Stochastic Gradient Optimization Techniques参考文献 20被引用 38

一句话总结

本文解释了为何在过参数化且插值训练数据的模型中，使用小批量的随机梯度下降（SGD）能够快速收敛。它识别出一个临界批量大小 $m^*$，表明当 $m \leq m^*$ 时，SGD 的性能随批量大小线性提升，而当 $m > m^*$ 时，性能趋于饱和——这使得在插值情形下，SGD 相较于完整梯度下降实现了 $O(n)$ 的计算加速。

ABSTRACT

In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss (classification and regression) close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for fast convergence of SGD, comparable in number of iterations to full gradient descent. For convex loss functions we obtain an exponential convergence bound for {\it mini-batch} SGD parallel to that for full gradient descent. We show that there is a critical batch size $m^*$ such that: (a) SGD iteration with mini-batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of mini-batch size $1$ (\emph{linear scaling regime}). (b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration (\emph{saturation regime}). Moreover, for the quadratic loss, we derive explicit expressions for the optimal mini-batch and step size and explicitly characterize the two regimes above. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. We give experimental evidence on real data which closely follows our theoretical analyses. Finally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.

研究动机与目标

解释现代过参数化学习中，小批量 SGD 在模型插值训练数据时取得经验成功的原因。
分析在插值情形下（即经验损失趋近于零）小批量 SGD 的收敛速率。
识别出一个临界批量大小 $m^*$，用以区分 SGD 效率中的线性缩放与饱和行为。
提供收敛速度与计算效率的理论边界，表明 SGD 可在迭代次数上与完整梯度下降相当。
将理论发现与实际实践（如深度学习中的线性缩放规则）联系起来。

提出的方法

在最优解实现零训练损失的插值假设下，分析凸损失函数。
推导小批量 SGD 的指数收敛边界，揭示其对批量大小 $m$ 和学习率的依赖关系。
识别出临界批量大小 $m^*$，其定义为 $m^* \approx \frac{\lambda_1}{\beta}$，用以区分两种情形：线性缩放（$m \leq m^*$）与饱和（$m > m^*$）。
利用方差缩减技术与 Hessian 矩阵的谱分析，刻画过参数化设置下的收敛速率。
在二次损失情形下，推导出最优批量大小与学习率的显式表达式。
通过在 MNIST、TIMIT 和 HINT-S 数据集上使用核方法与深度学习进行实验，验证理论预测。

实验结果

研究问题

RQ1为何在理论上收敛速度更慢的情况下，小批量 SGD 在实践中仍优于完整梯度下降？
RQ2过参数化与数据插值在实现快速 SGD 收敛中起到了何种作用？
RQ3决定 SGD 效率中线性缩放与饱和行为转换的临界批量大小 $m^*$ 是什么？
RQ4在插值情形下，SGD 的计算效率如何随小批量大小变化？
RQ5深度学习中广泛使用的线性缩放规则能否在插值设定下获得理论支持？

主要发现

在插值情形下的凸损失函数中，小批量 SGD 实现了指数收敛，其迭代次数与完整梯度下降相当。
存在一个临界批量大小 $m^*$，使得当 $m \leq m^*$ 时，批量大小为 $m$ 的 SGD 几乎等价于单次 SGD 迭代 $m$ 次（即线性缩放情形）。
当 $m > m^*$ 时，增加批量大小带来的收益递减，性能趋于饱和且收敛变慢（即饱和情形）。
临界批量大小 $m^*$ 几乎与数据集大小 $n$ 无关，从而在单位计算量下可实现 $O(n)$ 的加速，优于完整梯度下降。
在二次损失情形下，推导出最优批量大小与学习率的显式公式，验证了两阶段行为。
在 MNIST、TIMIT 和 HINT-S 上的实验结果表明，训练误差曲线与理论预测高度吻合，且在不同核函数与数据分布下表现出相似的相对效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。