[论文解读] "Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
Short-Dot 引入了一种受编码理论启发的方法,通过使用大量短小、稀疏的点积,在分布式系统中计算大型线性变换,使任意 K 个处理器就足以恢复 A x,从而缓解慢节点的影响。
Faced with saturation of Moore's law and increasing dimension of data, system designers have increasingly resorted to parallel and distributed computing. However, distributed computing is often bottle necked by a small fraction of slow processors called "stragglers" that reduce the speed of computation because the fusion node has to wait for all processors to finish. To combat the effect of stragglers, recent literature introduces redundancy in computations across processors, e.g.,~using repetition-based strategies or erasure codes. The fusion node can exploit this redundancy by completing the computation using outputs from only a subset of the processors, ignoring the stragglers. In this paper, we propose a novel technique -- that we call "Short-Dot" -- to introduce redundant computations in a coding theory inspired fashion, for computing linear transforms of long vectors. Instead of computing long dot products as required in the original linear transform, we construct a larger number of redundant and short dot products that can be computed faster and more efficiently at individual processors. In reference to comparable schemes that introduce redundancy to tackle stragglers, Short-Dot reduces the cost of computation, storage and communication since shorter portions are stored and computed at each processor, and also shorter portions of the input is communicated to each processor. We demonstrate through probabilistic analysis as well as experiments that Short-Dot offers significant speed-up compared to existing techniques. We also derive trade-offs between the length of the dot-products and the resilience to stragglers (number of processors to wait for), for any such strategy and compare it to that achieved by our strategy.
研究动机与目标
- 在慢节点引起的延迟下,推动高维线性变换的快速计算。
- 开发一种编码策略,在保持 A x 可恢复性的前提下,降低每个处理器的点积长度。
- 刻画点积长度与对慢节点容忍度之间的基本权衡。
- 提供分析和实证结果,显示相较于现有方案的性能提升。
提出的方法
- 构造一个 P × N 的矩阵 F,使得任意 K 行可以线性组合以恢复 A 的 M 行,同时 F 的每一行都是稀疏的,稀疏度为 s = (N/P)(P−K+M)。
- 离线使用矩阵 B 和附加向量对 F 进行编码,以强制稀疏性模式和恢复性质。
- 将短点积分发给 P 个处理器;每个处理器计算对其稀疏模式限定的 x 的点积。
- 融合节点使用前 K 个响应,通过由 B 的相应行确定的线性组合来恢复 Ax。
- 给出理论界限,展示在较大 N 时稀疏性极限和近优性,并与 MDS 和重复策略进行比较。
- 在移位指数模型下分析计算时间,以将 Short-Dot 与未编码、重复和 MDS 方案进行比较。
实验结果
研究问题
- RQ1是否可以在每个处理器的受控稀疏性下,从 K 次/份的短而稀疏点积中恢复任意 A x?
- RQ2点积长度与等待的处理器数量(K)之间的基本权衡是什么?
- RQ3在慢节点条件下,Short-Dot 相对于未编码、重复和基于 MDS 的策略的表现如何?
- RQ4在稀疏性和鲁棒性方面,Short-Dot 在何种条件下接近最优?
- RQ5在大规模场景下,Short-Dot 的期望计算时间收益是多少?
主要发现
- Short-Dot 实现了每处理器点积的稀疏度 s = (N/P)(P−K+M),同时确保 F 的任意 K 行都能生成 A x 向量。
- 本文证明了具有所需属性的 F 的存在性并推导出平均稀疏性的下界;在 N 很大且 M>1 的情况下,Short-Dot 达到接近最优的稀疏度。
- 在移位指数慢化模型下,Short-Dot 在关键模式下比未编码、重复和 MDS 策略具有更低的期望计算时间,包括 M=Θ(P) 的情形和某些子线性情况。
- Short-Dot 可能提供渐近更快的计算时间,收益随 log(P) 或与 P 相关的因子而扩展,取决于 M 与 P。
- 该方法降低了每个处理器的存储和通信负载,因为每个点积更短且输入子集被传输。
- 实验结果表明,在易受慢节点影响的环境中,Short-Dot 的性能优于现有策略。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。