QUICK REVIEW

[论文解读] Stochastic Optimization with Importance Sampling

Peilin Zhao, Tong Zhang|arXiv (Cornell University)|Jan 13, 2014

Stochastic Gradient Optimization Techniques参考文献 25被引用 66

一句话总结

本文提出了用于近端随机梯度下降（prox-SGD）和近端随机对偶坐标上升（prox-SDCA）的重要性采样策略，以降低随机梯度方差并加速收敛。通过根据梯度范数或光滑性参数采样数据点，该方法相比均匀采样实现了显著更快的收敛速度，且具有理论保证，并在多个数据集上得到实证验证。

ABSTRACT

Uniform sampling of training data has been commonly used in traditional stochastic optimization algorithms such as Proximal Stochastic Gradient Descent (prox-SGD) and Proximal Stochastic Dual Coordinate Ascent (prox-SDCA). Although uniform sampling can guarantee that the sampled stochastic quantity is an unbiased estimate of the corresponding true quantity, the resulting estimator may have a rather high variance, which negatively affects the convergence of the underlying optimization procedure. In this paper we study stochastic optimization with importance sampling, which improves the convergence rate by reducing the stochastic variance. Specifically, we study prox-SGD (actually, stochastic mirror descent) with importance sampling and prox-SDCA with importance sampling. For prox-SGD, instead of adopting uniform sampling throughout the training process, the proposed algorithm employs importance sampling to minimize the variance of the stochastic gradient. For prox-SDCA, the proposed importance sampling scheme aims to achieve higher expected dual value at each dual coordinate ascent step. We provide extensive theoretical analysis to show that the convergence rates with the proposed importance sampling methods can be significantly improved under suitable conditions both for prox-SGD and for prox-SDCA. Experiments are provided to verify the theoretical analysis.

研究动机与目标

解决随机优化中因均匀采样导致的随机梯度估计器方差过高的问题。
通过最小化方差的非均匀采样，提升prox-SGD和prox-SDCA的收敛速率。
基于梯度范数和光滑性参数，为两种算法推导最优采样分布。
在适当条件下提供理论上的收敛速率改进，推广现有结果。
在真实世界数据集上对所提方法进行实证验证，确认其在对偶间隙减少方面更快且性能稳定。

提出的方法

对于prox-SGD，该方法采用重要性采样，其中采样概率与随机梯度范数成正比，以最小化梯度估计器的方差。
利用这些非均匀采样概率构建无偏、加权的重要度梯度估计器，以保持收敛保证。
对于prox-SDCA，采样分布被推导为使每次迭代的对偶目标函数期望增量最大化，其依赖于损失函数的光滑性常数。
理论分析表明，最优采样分布取决于梯度范数（对于prox-SGD）和损失函数光滑性（对于prox-SDCA）。
使用梯度范数的上界以简化计算，同时保留方差减少的优势。
该框架可推广至近端随机镜像下降，并将标准均匀采样作为特例包含在内。

实验结果

研究问题

RQ1重要性采样能否在prox-SGD中将随机梯度的方差降低至均匀采样以下？
RQ2使梯度方差最小化的prox-SGD最优采样分布是什么？
RQ3如何将重要性采样适配至prox-SDCA，以使每次迭代的对偶目标函数改进最大化？
RQ4与均匀采样相比，重要性采样可实现多大的理论收敛速率改进？
RQ5所提方法在加速收敛的同时，是否保持或提升了测试准确率？

主要发现

所提出的prox-SGD重要性采样策略通过按梯度范数成比例采样数据点，实现了更低方差的梯度估计器。
对于prox-SDCA，最优采样分布依赖于损失函数的光滑性常数，从而实现更快的对偶目标函数改进。
理论分析表明，在适当条件下收敛速率得到显著提升，新方法推广了现有均匀采样结果。
在ijcnn1、kdd2010和w8a等数据集上的实证结果表明，Iprox-SDCA在对偶间隙方面相比标准SDCA收敛更快。
Iprox-SDCA的测试误差率与标准SDCA相当，表明尽管收敛更快，但泛化性能未下降。
Iprox-SDCA中随机梯度的方差略有降低，但由于SDCA本身具有内在方差减少机制，改进幅度较小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。