QUICK REVIEW

[论文解读] Variance Reduction for Faster Non-Convex Optimization

Zeyuan Allen-Zhu, Elad Hazan|arXiv (Cornell University)|Mar 17, 2016

Stochastic Gradient Optimization Techniques参考文献 25被引用 126

一句话总结

本文提出一种针对非凸优化的方差降维随机方法，在达到 ε-驻点时的迭代次数为 O(n^{2/3} / ε)，优于梯度下降和 SGD，且不需要额外假设。

ABSTRACT

We consider the fundamental problem in non-convex optimization of efficiently reaching a stationary point. In contrast to the convex case, in the long history of this basic problem, the only known theoretical results on first-order non-convex optimization remain to be full gradient descent that converges in $O(1/\varepsilon)$ iterations for smooth objectives, and stochastic gradient descent that converges in $O(1/\varepsilon^2)$ iterations for objectives that are sum of smooth functions. We provide the first improvement in this line of research. Our result is based on the variance reduction trick recently introduced to convex optimization, as well as a brand new analysis of variance reduction that is suitable for non-convex optimization. For objectives that are sum of smooth functions, our first-order minibatch stochastic method converges with an $O(1/\varepsilon)$ rate, and is faster than full gradient descent by $Ω(n^{1/3})$. We demonstrate the effectiveness of our methods on empirical risk minimizations with non-convex loss functions and training neural nets.

研究动机与目标

动机：高效寻找非凸目标的驻点。
在非凸损失下，提出适用于非凸问题的方差降方法，改进 GD/SGD。
开发基于 SVRG 的算法，具有可证明的 O(n^{2/3}L(f(x0)−f(x*)) / ε) 收敛速率。
将方差降技术扩展到非凸设置，并分析方差上界。
在具有非凸损失的 ERM 和神经网络上展示实证有效性。

提出的方法

采用与 SVRG 相类似的方差降梯度估计器，适用于非凸目标。
使用带快照点 xs0 的时期结构和内部迭代来形成 e∇k = ∇fi(xsk) − ∇fi(xs0) + ∇f(xs0)。
设内部迭代长度 m = n，步长 η = Θ(1/(n^{2/3}L))。
将每个时期划分为子时期，以折叠方差界，利用镜像下降分析。
证明梯度估计器的方差被界定为 O(∥xsk − xs0∥^2)，并将其与目标函数的下降联系起来。
给出算法变体并讨论小批量、非均匀光滑性，以及对和为非凸目标的扩展。

实验结果

研究问题

RQ1方差降技术是否能在非凸优化中比 GD/SGD 更快收敛到 ε-驻点？
RQ2对于非凸目标，哪些方差上界和分析技术适合实现此类加速？
RQ3SVRG 如何适应（快照选择、时期/子时期结构）非凸损失？
RQ4这些方法在实际中是否扩展到具有非凸损失的 ERM 和神经网络？

主要发现

提出的非凸 SVRG 变体在 O(n^{2/3}L(f(x0)−f(x*)) / ε) 次迭代内达到 ε-驻点。
SVRG 的每次迭代与 SGD 一样快，比全梯度下降快 n 倍，从理论上比 GD 提供 Ω(n^{1/3}) 的加速。
方差界被确立为 O(∥xsk − xs0∥^2)，使得通过时期/子时期分析来保证前进。
当 m = n 且 η = Θ(1/(n^{2/3}L)) 时，算法输出的 x 满足 E[∥∇f(x)∥^2] ≤ O(L(f(xφ)−min f) / (S n^{1/3})).
对具有非凸损失的 ERM 和神经网络的实验表明，SVRG 在训练速度上可以达到甚至超过 SGD，尤其是在 ε 较小且损失非凸时。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。