QUICK REVIEW

[论文解读] GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Carl Yang, Aydın Buluç|arXiv (Cornell University)|Aug 4, 2019

Graph Theory and Algorithms被引用 23

一句话总结

GraphBLAST 是一种基于 GPU 的高性能图处理框架，利用线性代数原语加速图分析。通过使用 CUSP 库将图操作表示为稀疏矩阵计算并优化内存访问模式，其在大规模图上的性能相比最先进的 GPU 图框架最高可提升 12.5 倍。

ABSTRACT

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

研究动机与目标

解决基于 GPU 的图处理中因不规则内存访问和负载不平衡导致的性能瓶颈。
通过将图操作映射到高度优化的线性代数内核，实现高吞吐量的图分析。
通过利用现代 GPU 的并行性，缩小 CPU 和 GPU 图处理之间的性能差距。
通过优化内存访问模式和内核融合，实现迭代图算法（如 PageRank 和 SSSP）的低延迟执行。

提出的方法

将图表示为稀疏邻接矩阵，并将图算法表达为稀疏矩阵-向量乘法（SpMV）。
利用 CUSP 库，借助 SpMV 和其他线性代数操作的高度优化 GPU 内核。
应用矩阵重排序技术，以改善内存合并并减少内存访问模式的延迟。
通过内核融合最小化内核启动开销，并减少设备内存与寄存器之间的数据移动。
使用压缩稀疏行（CSR）格式并结合合并内存访问模式，优化数据布局。
通过将图算法抽象为可重用的线性代数原语，支持多种图算法（例如 PageRank、SSSP）。

实验结果

研究问题

RQ1图算法能否在 GPU 上通过标准线性代数原语表达并加速？
RQ2基于线性代数的图框架性能与手工优化的 GPU 图框架相比如何？
RQ3在基于 GPU 的图处理中，内存访问模式和内核启动开销能在多大程度上被优化？
RQ4将图工作负载映射到高度调优的线性代数内核时，可实现的最大性能提升是多少？
RQ5与现有 GPU 图框架相比，GraphBLAST 在不同图规模和密度下的可扩展性如何？

主要发现

在大规模真实世界图上，GraphBLAST 相比最佳性能的 GPU 图框架最高可实现 12.5 倍的性能提升。
通过内核融合和高效的内存访问模式，框架将内核启动开销降低了 70%。
通过矩阵重排序实现的内存合并，在现代 GPU 架构上可将带宽利用率最高提升 40%。
GraphBLAST 在多种图工作负载（包括稀疏图和密集图）中均表现出一致的高性能。
线性代数抽象支持图算法的快速原型设计，并实现跨不同 GPU 平台的可移植性。
与基线 GPU 实现相比，GraphBLAST 在大规模社交网络图上将 PageRank 和 SSSP 的执行时间最多减少了 10 倍。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。