QUICK REVIEW

[论文解读] Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent

Yudong Chen, Lili Su|arXiv (Cornell University)|May 16, 2017

Stochastic Gradient Optimization Techniques参考文献 24被引用 142

一句话总结

本文提出 Byzantine Gradient Descent，一种鲁棒的分布式学习算法，能容忍最多约 2(1+ε)q 个 Byzantine 工作节点，并在每轮对数 N 次迭代中实现指数收敛，误差约为 max{√(dq/N), √(d/N)}。

ABSTRACT

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning -- a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function.

研究动机与目标

在联邦学习等环境中，激励在存在对手式（拜占庭）故障的情况下进行分布式统计学习。
开发一种鲁棒的梯度聚合方法，容忍拜占庭故障。
给出收敛性保证并刻画在拜占庭故障下的估计误差。
分析所提出方法的计算与通信成本。
给出线性回归的一个应用以说明该方法。

提出的方法

提出 Byzantine Gradient Descent，其中服务端使用基于批量均值和几何中位数的鲁棒方案来聚合梯度。
将 m 台工作机器分成 k 个批次并计算梯度的批均值。
计算这 k 个批均值的几何中位数，以形成用于更新的聚合梯度。
在强凸性和梯度 Lipschitz 假设下，使用步长 η 的梯度下降步，取 η = L/(2M^2)。
给出一个形式化的收敛定理，证明在 log N 轮内以指数收弛，误差界随 √(dq/N) 和 √(d/N) 增长。
计算成本分析：每台工作机的计算成本为 O((Nd/m) log N)，参数服务器为 O(md + qd log^3 N)，通信成本为 O(md log N)。

实验结果

研究问题

RQ1分布式学习算法是否能够在每个工作节点使用本地数据的同时容忍拜占庭（任意）故障？
RQ2哪些聚合规则可以鲁棒地结合梯度以减轻拜占庭影响而不破坏收敛性？
RQ3在拜占庭故障下，分布式学习的收敛速度和统计误差界是多少？
RQ4应如何选择系统参数（k、q、m、N、d）以在容错能力和统计准确性之间取得平衡？
RQ5该方法如何应用于具体问题，如线性回归？

主要发现

所提出的 Byzantine Gradient Descent 方法在任意固定的 ε>0 下，容忍高达 2(1+ε)q ≤ m 的拜占庭失效。
该估计量在 O(log N) 轮内收敛，误差界为 max{√(dq/N), √(d/N)}。
在拜占庭设置下，极小极大下界的最优速率 √(d/N) 可实现，误差至多比 √q 的因子。
总计算成本为每个工作节点 O((Nd/m) log N)，参数服务器为 O(md + qd log^3 N)，通信成本为 O(md log N)。
对于线性回归，该框架展示了适用性并对对手式工作节点具有鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。