QUICK REVIEW

[论文解读] Throughput-Optimal Topology Design for Cross-Silo Federated Learning

Othmane Marfoq, Chuan Xu|arXiv (Cornell University)|Oct 23, 2020

Privacy-Preserving Technologies in Data参考文献 99被引用 50

一句话总结

本论文设计跨实验室/跨 silo 联邦学习的拓扑，以最大化系统吞吐量，利用 max-plus 线性系统，与服务器中心和 MATCHA 基于的方法相比显示出显著的训练加速。

ABSTRACT

Federated learning usually employs a client-server architecture where an orchestrator iteratively aggregates model updates from remote clients and pushes them back a refined model. This approach may be inefficient in cross-silo settings, as close-by data silos with high-speed access links may exchange information faster than with the orchestrator, and the orchestrator may become a communication bottleneck. In this paper we define the problem of topology design for cross-silo federated learning using the theory of max-plus linear systems to compute the system throughput---number of communication rounds per time unit. We also propose practical algorithms that, under the knowledge of measurable network characteristics, find a topology with the largest throughput or with provable throughput guarantees. In realistic Internet networks with 10 Gbps access links for silos, our algorithms speed up training by a factor 9 and 1.5 in comparison to the master-slave architecture and to state-of-the-art MATCHA, respectively. Speedups are even larger with slower access links.

研究动机与目标

动机：通过利用快速的跨-silo 链路提升联邦学习效率。
目标：设计在遵循覆盖连接性的前提下最大化训练吞吐量（单位时间内的轮数）的通信拓扑。
方法：将网络测量整合到拓扑设计中，以最小化 max-plus 系统的周期时间。
结果：提供具有最优或近似最优保证的算法，并在真实网络拓扑上验证速度提升。

提出的方法

将训练过程建模为具有本地更新和邻居通信的同步 DPASGD。
将覆盖边的延迟定义为 d_o(i,j)=s·T_c(i)+l(i,j)+M/A(i′,j′)，其中使用底层、连通性和覆盖图。
通过 max-plus 代数形成循环时间：τ(G_o)=max_γ d_o(γ)/|γ|，吞吐量为 1/τ(G_o)。
在边容量和节点容量设定下提出拓扑设计算法（MCT 问题）。
给出逼近性和最优性结果：通过 Prim’s 针对边容量化无向覆盖图的最小生成树；对于欧几里得边容量图的 Christofides 算法得到 3N-近似；对于某些节点容量的欧几里得情况有 6-近似；对于有向覆盖图的 NP-hard 性质结果。
展示与 STAR 和 MATCHA/MATCHA+ 覆盖的实际性能比较。

实验结果

研究问题

RQ1如何在连通图 G_c 中设计一个覆盖 G_o，以在跨 silo FL 中最小化循环时间并最大化吞吐量？
RQ2对于边容量化 vs 节点容量化，以及无向 vs 有向覆盖，MCT 的算法保障（最优/近似）是什么？
RQ3在考虑底层延迟、计算时间和排队等因素时，提出的拓扑设计如何影响训练时间和收敛性？
RQ4在现实网络中，吞吐量导向的拓扑是否比服务器中心或谱优化覆盖更快的墙钟训练？

主要发现

设计为最大化吞吐量的覆盖在多网络上实现比 STAR 更快的训练时间，且通常优于 MATCHA/MATCHA+。
RING、MST 和 δ-MBST 拓扑在循环时间上有显著 reductions，在慢访问场景中 RING 相较于 STAR 可快 2N 倍。
在 iNaturalist 实验中，基于底层网络/连通性数据设计的覆盖显著减少循环时间，转化为显著的墙钟加速。
对于慢访问链接，低度数覆盖（如 RING、MST、δ-MBST）由于每轮延迟较低而优于高节点覆盖。
MATCHA+ 可以超越某些基线，但需要底层知识；在没有底层假设的吞吐量优先设计在实践中仍然优于。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。