[Paper Review] Less than a Single Pass: Stochastically Controlled Stochastic Gradient Method
This paper introduces Stochastically Controlled Stochastic Gradient (SCSG), a novel variance-reduced optimization method that achieves convergence in less than a single full data pass for low-accuracy problems. By using a geometric random variable to control the number of iterations and subsampling gradients, SCSG reduces both computation and communication costs below linear dependence on dataset size n, especially effective in low-accuracy regimes where it outperforms SGD in theory and practice.
We develop and analyze a procedure for gradient-based optimization that we refer to as stochastically controlled stochastic gradient (SCSG). As a member of the SVRG family of algorithms, SCSG makes use of gradient estimates at two scales, with the number of updates at the faster scale being governed by a geometric random variable. Unlike most existing algorithms in this family, both the computation cost and the communication cost of SCSG do not necessarily scale linearly with the sample size $n$; indeed, these costs are independent of $n$ when the target accuracy is low. An experimental evaluation on real datasets confirms the effectiveness of SCSG.
Motivation & Objective
- Address the inefficiency of existing SVRG-family methods that scale linearly with dataset size n in computation and communication.
- Develop a method that achieves convergence with less than a single pass through the data, particularly effective when target accuracy ε is low.
- Reduce dependence on n in computational and communication costs by introducing stochastic control over the number of iterations via a geometric random variable.
- Introduce a new problem difficulty measure H(f) that provides finite and small bounds for many practical problems where SGD lacks theoretical guarantees.
- Demonstrate that SCSG maintains favorable convergence rates comparable to SGD but with significantly improved constants, especially in low-accuracy regimes.
Proposed method
- Propose SCSG as a variant of SVRG that uses a subsampled full gradient estimate instead of the full dataset gradient.
- Control the number of inner iterations using a geometrically distributed random variable, allowing the algorithm to terminate early with high probability.
- Use a two-scale gradient estimation: stochastic gradients from mini-batches and a control variate from a subsampled full gradient.
- Introduce a new problem-specific measure H(f) that characterizes the intrinsic difficulty of finite-sum optimization problems.
- Design the algorithm so that both computation and communication costs are independent of n when the target accuracy ε is low.
- Theoretical analysis shows that the expected number of gradient evaluations scales as O((H(f)/(με) ∧ n + κ) log(Δf/ε)), with H(f) replacing the uniform gradient norm bound used in SGD.
Experimental results
Research questions
- RQ1Can a variance-reduced stochastic optimization method achieve convergence in less than a single pass through the data for low-accuracy problems?
- RQ2How can communication and computation costs be reduced below linear dependence on n in finite-sum optimization?
- RQ3What new problem measure can replace the uniform gradient norm bound in SGD to provide finite and tighter convergence guarantees?
- RQ4Can a stochastic control mechanism over iteration count lead to improved theoretical and practical performance in optimization?
- RQ5How does the new difficulty measure H(f) compare to existing measures in capturing the intrinsic complexity of finite-sum problems?
Key findings
- SCSG achieves convergence in less than a single full data pass when the target accuracy ε is low, making it highly efficient for large-scale problems.
- The expected computation cost of SCSG is O((H(f)/(με) ∧ n + κ) log(Δf/ε)), which is sublinear in n for low ε, unlike standard SVRG or SGD.
- The algorithm's convergence rate depends on H(f), a new finite measure that is O(1) in many practical problems (e.g., least squares, logistic regression), unlike the potentially infinite uniform gradient norm bound in SGD.
- For multi-class logistic regression, the paper proves that H(f) ≤ (2/n)∑‖ai‖², showing it remains bounded and small under standard assumptions.
- Empirical results on real datasets confirm that SCSG outperforms SGD and other SVRG variants in terms of convergence speed and communication efficiency.
- Theoretical analysis shows that SCSG never performs worse than SGD in low-accuracy regimes and can achieve significantly better constants due to the H(f) measure.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.