QUICK REVIEW

[论文解读] SGD and Hogwild! Convergence Without the Bounded Gradients Assumption

Lam M. Nguyen, Phuong Ha Nguyen|arXiv (Cornell University)|Feb 11, 2018

Stochastic processes and financial applications被引用 39

一句话总结

本文在强凸性条件下建立了SGD和Hogwild!的收敛性，且无需假设梯度一致有界——这是在强凸设定下通常不成立的标准假设。通过利用机器学习问题中个体随机梯度为Lipschitz连续且整体目标函数为强凸的结构，作者推导出使用递减步长时的新收敛速率，证明了 $\mathbb{E}[\|\hat{w}_{t+1}-w_{*}\|^{2}] \leq \frac{4\alpha^{2}DN}{\mu^{2}}\frac{1}{t} + O(\frac{1}{t\ln t})$，这是首个针对Hogwild!在递减步长下的此类结果。

ABSTRACT

Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical convergence analysis of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016), a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds; and we also propose an alternative convergence analysis of SGD with diminishing learning rate regime, which results in more relaxed conditions than those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.

研究动机与目标

为解决经典SGD收敛性分析的局限性，该分析依赖于随机梯度一致有界的不现实假设。
表明该梯度有界假设在强凸问题中本质上不成立，例如在正则化最小二乘和逻辑回归中。
在更现实的假设下建立SGD和Hogwild!的收敛性，具体为：个体随机函数是凸的且梯度Lipschitz连续，整体目标函数为强凸。
在不依赖梯度有界性的前提下，推导出两种方法在递减步长制度下的收敛速率。
提供首个针对Hogwild!在递减学习率下的收敛性分析，将其理论基础扩展至实际应用场景。

提出的方法

作者在每个 $f(w;\xi)$ 为凸且梯度Lipschitz连续，且期望目标 $F(w)$ 为 $\mu$-强凸的假设下分析SGD和Hogwild!。
提出一种新颖的分析框架，通过利用机器学习问题的结构，避免了统一有界梯度假设，其中随机梯度相对于真实梯度范数自然有界。
对于Hogwild!，他们对有界延迟 $\tau(t)$ 的异步更新进行建模，并推导出到最优解 $w_*$ 的期望平方距离的界。
他们使用一个涉及 $\mathbb{E}[\|\hat{w}_t - w_*\|^2]$ 的递归期望界，整合了梯度噪声、延迟和步长衰减的项。
分析引入了一个时变延迟 $\tau(t) \leq \sqrt{t \cdot L(t)}$，其中 $L(t) = \frac{1}{\ln t} - \frac{1}{(\ln t)^2}$，以控制误差累积的增长。
他们推导出一个关键引理，表明在递减步长 $\eta_t = \frac{\alpha_t}{\mu(t + 2\tau(t))}$ 且 $\alpha_t \in [12, \alpha]$ 的条件下，期望误差以 $O(1/t)$ 速率衰减，同时存在一个次要的 $O(1/(t \ln t))$ 项。

实验结果

研究问题

RQ1SGD能否在经典假设——即随机梯度一致有界——不成立的情况下收敛，特别是在强凸问题中？
RQ2在先前分析要求常数或多项对数步长的前提下，Hogwild!算法在递减学习率调度下是否收敛？
RQ3当随机梯度并非一致有界但整体目标函数为强凸时，SGD和Hogwild!的收敛速率是什么？
RQ4在异步设置中存在有界更新延迟时，其对收敛性有何影响？是否可被期望量化？
RQ5该分析能否扩展至非凸的个体函数 $f(w;\xi)$，同时在 $F(w)$ 强凸的前提下保持收敛性？

主要发现

本文证明，经典的一致有界随机梯度假设与强凸性不相容，因为它会导致与目标函数增长的矛盾。
对于SGD，期望平方误差 $\mathbb{E}[\|\hat{w}_{t+1} - w_*\|^2]$ 以 $\frac{4\alpha^2DN}{\mu^2} \cdot \frac{1}{t} + O\left(\frac{1}{t\ln t}\right)$ 的速率衰减，这是首个在无梯度有界假设下的此类结果。
分析确认，当递减步长 $\eta_t = \frac{\alpha_t}{\mu(t + 2\tau(t))}$ 且 $\alpha_t \in [12, \alpha]$ 时，误差界依然成立，确保即使在最优解附近梯度增长时也能收敛。
对于Hogwild!，作者在相同假设下建立了首个使用递减学习率的收敛结果，通过建模时变延迟 $\tau(t) \leq \sqrt{t \cdot L(t)}$ 的异步更新。
推导出的收敛速率对延迟和噪声具有鲁棒性，主导误差项以 $O(1/t)$ 衰减，次要项 $O(1/(t\ln t))$ 在 $t$ 较大时趋于可忽略。
分析表明，当 $t \geq \exp\left[2\sqrt{\Delta}\left(1 + \frac{(L+\mu)\alpha}{\mu}\right)\right]$ 时，$O(1/t)$ 项成为主导，验证了渐近收敛速率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。