QUICK REVIEW

[论文解读] Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization

Jiaqi Zhang, Keyou You|arXiv (Cornell University)|Sep 6, 2019

Stochastic Gradient Optimization Techniques参考文献 60被引用 25

一句话总结

该论文提出了一种去中心化的随机梯度追踪（DSGT）算法，用于对等网络中的非凸经验风险最小化，其中每个节点使用与其本地数据规模成比例的迷你批量随机梯度。该工作建立了 $ O(1/\text{sum of stepsizes}) $ 的非渐近收敛速率，表明在特定条件下具有网络独立性，并在某些情况下实现线性加速，其性能与集中式SGD相当。

ABSTRACT

This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems. To ensure exact convergence and handle the variance among decentralized datasets, each node performs a stochastic gradient (SG) tracking step by using a mini-batch of samples, where the batch size is designed to be proportional to the size of the local dataset. We explicitly evaluate the convergence rate of DSGT with respect to the number of iterations in terms of algebraic connectivity of the network, mini-batch size, gradient variance, etc. Under certain conditions, we further show that DSGT has a network independence property in the sense that the network topology only affects the convergence rate up to a constant factor. Hence, the convergence rate of DSGT can be comparable to the centralized SGD method. Moreover, a linear speedup of DSGT with respect to the number of nodes is achievable for some scenarios. Numerical experiments for neural networks and logistic regression problems on CIFAR-10 finally illustrate the advantages of DSGT.

研究动机与目标

解决去中心化随机梯度方法在非凸设置下，特别是针对异构和去中心化数据集时缺乏收敛保证的问题。
设计一种去中心化算法，即使在本地数据分布和规模存在方差的情况下，也能确保精确收敛至驻点。
分析所提出的DSGT算法的收敛速率，涉及网络代数连通性、迷你批量大小、梯度方差和步长规则。
研究网络拓扑是否仅通过常数因子影响收敛性，目标是实现收敛行为的网络独立性。
在特定条件下，证明DSGT算法在节点数量上实现线性加速，从而提升去中心化学习的可扩展性。

提出的方法

提出一种去中心化的随机梯度追踪（DSGT）算法，其中每个节点通过使用其本地数据的迷你批量来维护全局梯度的本地估计。
将迷你批量大小设计为与本地数据集大小成比例，以在方差与通信效率之间取得平衡。
引入梯度追踪机制，使节点能够通过与邻居通信来追踪网络中的平均梯度。
使用基于一致性更新规则的混合矩阵 $ W $ 来聚合本地梯度和状态，确保节点间的一致性。
应用常数步长和递减步长规则，以在不同条件下分析收敛性。
将通信图的代数连通性 $ (1 - \rho) $ 作为收敛速率界中的关键参数。

实验结果

研究问题

RQ1去中心化的随机梯度方法能否在异构本地数据集下，实现对非凸经验风险最小化的精确收敛？
RQ2DSGT的收敛速率如何依赖于网络代数连通性、迷你批量大小和梯度方差？
RQ3在何种条件下，网络拓扑仅通过常数因子影响收敛速率，从而实现网络独立性？
RQ4DSGT算法能否在去中心化训练中实现与节点数量成线性加速的关系？
RQ5当目标函数为凸时，DSGT方法是否收敛至最优解，其收敛速率与集中式SGD相比如何？

主要发现

DSGT算法实现了非渐近收敛速率 $ O\big{(} \frac{1}{\text{sum of stepsizes}} \big{(} D + \rho^2 \frac{\rho^2 \rho^2}{(1-\rho)^3} \text{sum of stepsizes}^3 \big{)} \big{)} $，其中 $ D $ 与初始误差相关，$ \rho $ 为代数连通性，而 $ \rho^2 \frac{\rho^2 \rho^2}{(1-\rho)^3} $ 项捕捉了梯度方差的影响。
在常数步长下，收敛速率为 $ O\big{(} \frac{D\theta}{\theta} + \frac{\rho^2 D^2}{(1-\rho)^3 K} \big{)} $，在适当条件下表现出 $ 1/K $ 的收敛速率。
对于递减步长 $ \theta_k = O(1/k^p) $ 且 $ p \neq 0.5 $ 的情况，速率为 $ O(1/k^{1-p}) $；当 $ p = 0.5 $ 时，速率为 $ O(\text{ln}(k)/\theta) $，表明为次线性收敛。
当 $ \frac{\rho^2}{(1-\rho)^3} \text{sum of stepsizes}^3 = O(\text{sum of stepsizes}^2) $ 时，算法表现出网络独立性，意味着网络仅通过常数因子影响收敛性。
在此条件下，DSGT的收敛速率与集中式SGD相当，表明去中心化训练在收敛速度方面可与集中式方法媲美。
在CIFAR-10上的数值实验表明，DSGT在训练深度神经网络和逻辑回归时表现出具有竞争力的性能，验证了理论发现及加速潜力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。