QUICK REVIEW

[论文解读] Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization

Zhouyuan Huo, Heng Huang|arXiv (Cornell University)|Apr 12, 2016

Stochastic Gradient Optimization Techniques参考文献 23被引用 23

一句话总结

本文首次对非凸优化中的异步随机梯度下降方差减少方法（AsySVRG）进行了理论收敛性分析。证明了在共享内存和分布式内存架构下，AsySVRG 均能达到 $O(1/T)$ 的收敛速率，并且在增加工作线程数量时可实现线性加速，通过方差减少技术使收敛速度超越标准异步 SGD。

ABSTRACT

We provide the first theoretical analysis on the convergence rate of the asynchronous stochastic variance reduced gradient (SVRG) descent algorithm on non-convex optimization. Recent studies have shown that the asynchronous stochastic gradient descent (SGD) based algorithms with variance reduction converge with a linear convergent rate on convex problems. However, there is no work to analyze asynchronous SGD with variance reduction technique on non-convex problem. In this paper, we study two asynchronous parallel implementations of SVRG: one is on a distributed memory system and the other is on a shared memory system. We provide the theoretical analysis that both algorithms can obtain a convergence rate of $O(1/T)$, and linear speed up is achievable if the number of workers is upper bounded. V1,v2,v3 have been withdrawn due to reference issue, please refer the newest version v4.

研究动机与目标

为填补异步 SVRG 在非凸问题中理论理解的空白，此前的研究仅关注凸设置。
分析 AsySVRG 在两种不同并行架构下（共享内存与分布式内存系统）的收敛行为。
证明在非凸设置下，方差减少可使收敛速度优于标准异步 SGD。
证明在两种架构下，增加工作线程数量时可实现线性加速。

提出的方法

提出两种异步 SVRG 变体：一种用于共享内存（按坐标原子更新），另一种用于分布式内存（按向量原子更新）。
采用梯度范数平方的加权平均 $\mathbb{E}[||\nabla f(x)||^2]$ 作为非凸问题的收敛度量。
采用基于递推关系的分析方法，引入系数 $c_t$ 和 $\Gamma_t$ 以控制方差与延迟的影响。
施加标准假设：梯度无偏、$L$-光滑性，且时间延迟 $\Delta$ 有界。
通过分析每轮迭代中目标函数值的期望下降量，推导收敛界，利用带有延迟梯度的 SVRG 更新规则。
采用时变学习率 $\eta_t = \eta = \frac{u_0 b}{L n^\alpha}$，其中 $0 < \alpha < 1$，并设定每轮迭代次数 $m = \lfloor n^\alpha / (6u_0 b) \rfloor$

实验结果

研究问题

RQ1在非凸优化中，异步 SVRG 是否能实现比标准异步 SGD 更快的收敛速率？
RQ2在共享内存与分布式内存架构下，AsySVRG 是否能在非凸问题中保持线性收敛？
RQ3在非凸设置下，增加工作线程数量时，异步 SVRG 是否可实现线性加速？
RQ4在非凸目标函数中，梯度延迟与方差如何影响异步 SVRG 的收敛性？

主要发现

AsySVRG 在共享内存与分布式内存架构下，对非凸光滑问题均能达到 $O(1/T)$ 的收敛速率。
该收敛速率优于非凸设置下标准异步 SGD 的 $O(1/\sqrt{T})$ 速率。
在延迟 $\Delta$ 有界的条件下，增加工作线程数量时，线性加速可被严格证明。
分析表明，只要 $\Delta^2$ 有界，该方法在存在延迟梯度时仍保持稳定与收敛。
理论界依赖于一个正小常数 $\sigma$，当延迟与学习率足够受控时，该常数存在。
在 MNIST 与 CIFAR-10 上的实验结果验证了理论结论，显示更快的收敛速度与良好的可扩展性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。