QUICK REVIEW

[Paper Review] Stochastic Optimization with Importance Sampling

Peilin Zhao, Tong Zhang|arXiv (Cornell University)|Jan 13, 2014

Stochastic Gradient Optimization Techniques25 references66 citations

TL;DR

This paper proposes importance sampling strategies for proximal stochastic gradient descent (prox-SGD) and proximal stochastic dual coordinate ascent (prox-SDCA) to reduce stochastic gradient variance and accelerate convergence. By sampling data points according to gradient norms or smoothness parameters, the method achieves significantly faster convergence rates compared to uniform sampling, with theoretical guarantees and empirical validation across multiple datasets.

ABSTRACT

Uniform sampling of training data has been commonly used in traditional stochastic optimization algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA). Although uniform sampling can guarantee that the sampled stochastic quantity is an unbiased estimate of the corresponding true quantity, the resulting estimator may have a rather high variance, which negatively affects the convergence of the underlying optimization procedure. In this paper we study stochastic optimization with importance sampling, which improves the convergence rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror descent) with importance sampling and prox-SDCA with importance sampling. For prox-SGD, instead of adopting uniform sampling throughout the training process, the proposed algorithm employs importance sampling to minimize the variance of the stochastic gradient. For prox-SDCA, the proposed importance sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We provide extensive theoretical analysis to show that the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for prox-SDCA. Experiments are provided to verify the theoretical analysis.

Motivation & Objective

To address the high variance in stochastic gradient estimators caused by uniform sampling in stochastic optimization.
To improve convergence rates of prox-SGD and prox-SDCA by minimizing variance through non-uniform sampling.
To derive optimal sampling distributions based on gradient norms and smoothness parameters for both algorithms.
To provide theoretical convergence rate improvements under suitable conditions, generalizing existing results.
To validate the proposed methods empirically on real-world datasets, confirming faster duality gap reduction and stable performance.

Proposed method

For prox-SGD, the method uses importance sampling where sampling probability is proportional to the norm of the stochastic gradient, minimizing variance in the gradient estimator.
An unbiased, importance-weighted gradient estimator is constructed using these non-uniform sampling probabilities to maintain convergence guarantees.
For prox-SDCA, the sampling distribution is derived to maximize the expected increase in the dual objective per iteration, depending on smoothness constants of the loss functions.
Theoretical analysis shows that optimal sampling distributions depend on gradient norms (for prox-SGD) and loss function smoothness (for prox-SDCA).
Upper bounds on gradient norms are used to simplify computation while preserving variance reduction benefits.
The framework generalizes to proximal stochastic mirror descent and includes standard uniform sampling as a special case.

Experimental results

Research questions

RQ1Can importance sampling reduce the variance of stochastic gradients in prox-SGD beyond uniform sampling?
RQ2What is the optimal sampling distribution for prox-SGD that minimizes gradient variance?
RQ3How can importance sampling be adapted for prox-SDCA to maximize dual objective improvement per iteration?
RQ4What theoretical convergence rate improvements are achievable with importance sampling compared to uniform sampling?
RQ5Does the proposed method maintain or improve test accuracy while accelerating convergence?

Key findings

The proposed importance sampling strategy for prox-SGD achieves a lower variance gradient estimator by sampling data points with probability proportional to the norm of their gradients.
For prox-SDCA, the optimal sampling distribution depends on the smoothness constants of the loss functions, leading to faster dual objective improvement.
Theoretical analysis shows that the convergence rate is significantly improved under suitable conditions, with the new method generalizing existing uniform sampling results.
Empirical results on datasets like ijcnn1, kdd2010, and w8a show that Iprox-SDCA converges faster in terms of duality gap compared to standard SDCA.
Test error rates of Iprox-SDCA are comparable to standard SDCA, indicating no degradation in generalization despite faster convergence.
Variance of stochastic gradients is slightly reduced in Iprox-SDCA, but the improvement is small due to inherent variance reduction in SDCA.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.