QUICK REVIEW

[论文解读] numpywren: serverless linear algebra

Vaishaal Shankar, Karl Krauth|arXiv (Cornell University)|Oct 23, 2018

Cloud Computing and Resource Management参考文献 28被引用 67

一句话总结

numpywren 通过无服务器计算实现大规模线性代数，在关键算法上接近 ScaLAPACK 的性能，并显著提升计算效率，同时凸显无服务器端本地性限制。

ABSTRACT

Linear algebra operations are widely used in scientific computing and machine learning applications. However, it is challenging for scientists and data analysts to run linear algebra at scales beyond a single machine. Traditional approaches either require access to supercomputing clusters, or impose configuration and cluster management challenges. In this paper we show how the disaggregation of storage and compute resources in so-called "serverless" environments, combined with compute-intensive workload characteristics, can be exploited to achieve elastic scalability and ease of management. We present numpywren, a system for linear algebra built on a serverless architecture. We also introduce LAmbdaPACK, a domain-specific language designed to implement highly parallel linear algebra algorithms in a serverless setting. We show that, for certain linear algebra algorithms such as matrix multiply, singular value decomposition, and Cholesky decomposition, numpywren's performance (completion time) is within 33% of ScaLAPACK, and its compute efficiency (total CPU-hours) is up to 240% better due to elasticity, while providing an easier to use interface and better fault tolerance. At the same time, we show that the inability of serverless runtimes to exploit locality across the cores in a machine fundamentally limits their network efficiency, which limits performance on other algorithms such as QR factorization. This highlights how cloud providers could better support these types of computations through small changes in their infrastructure.

研究动机与目标

阐明超越单机可扩展线性代数的需求并降低集群容量规划复杂度。
提出一种将存储与计算分离的无服务器体系结构，以在线性代数工作负载中实现弹性扩展。
介绍 LAmbdaPACK，一种在无状态设置中对平铺矩阵表达并行线性代数的领域专用语言。
展示相对于传统高性能计算和容错数据并行系统的性能与容错优势。

提出的方法

开发一个无服务器系统（numpywren），将线性代数任务作为无状态函数运行，并使用分布式对象存储来保存中间状态。
引入 LAmbdaPACK，一种领域专用语言，将平铺线性代数算法表达为类似有向无环图的依赖图。
使用去中心化的依赖分析从 LAmbdaPACK 程序生成可执行任务图。
实现带租约的容错执行模型和弹性调度器来管理工作节点。
对 GEMM、QR、SVD 和 Cholesky 的端到端性能与 ScaLAPACK 和 Dask 进行比较评估。
讨论由于无服务器运行时缺乏对本地性的利用而带来的局限性，以及潜在的基础设施调整。

实验结果

研究问题

RQ1无服务器运行时能否在存储分离的情况下高效执行大规模线性代数？
RQ2无服务器线性代数在完成时间和 CPU 小时方面能接近传统 HPC 库到什么程度？
RQ3无状态任务设计在网络流量和线性代数算法容错方面的取舍是什么？
RQ4LAmbdaPACK 如何实现对复杂线性代数 DAG 的紧凑表示和可扩展调度？

主要发现

Numpywren 在矩阵乘法、SVD 和 Cholesky 分解的性能，接近 ScaLAPACK，差距在 33% 之内。
由于弹性，计算效率提升高达 240%。
在 1M x 1M 矩阵的 Cholesky 上，numpywren 的完成时间在 ScaLAPACK 的 36% 之内，并且 CPU 小时可以减少 33%。
相比 Dask，在容错数据并行工作负载上，numpywren 的性能可快至最高 320%。
无服务器本地性限制降低了某些算法（如 QR 因式分解）的网络效率，凸显了服务提供商在基础设施设计上的机会。
LAmbdaPACK 能实现紧凑的大规模 DAG 表示（数百万节点约 2 KB），并支持如 Cholesky、TSQR、LU 和 SVD 等关键算法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。