QUICK REVIEW

[論文レビュー] GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Carl Yang, Aydın Buluç|arXiv (Cornell University)|Aug 4, 2019

Graph Theory and Algorithms被引用数 23

ひとこと要約

GraphBLAST は、線形代数のプリミティブを活用してグラフ解析を高速化する、GPUベースの高性能なグラフ処理フレームワークです。CUSP ライブラリを用いてグラフ演算をスパース行列演算に表現し、メモリアクセスパターンを最適化することで、大規模グラフにおいて、最先端のGPUグラフフレームワーク比で最大12.5倍の高速化を達成しています。

ABSTRACT

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

研究の動機と目的

不規則なメモリアクセスとロードバランシングの不均衡によって引き起こされる、GPUベースのグラフ処理における性能ボトルネックを解消すること。
グラフ演算を高度に最適化された線形代数カーネルにマッピングすることで、高スルーレートのグラフ解析を実現すること。
現代のGPUの並列処理能力を活用して、CPUとGPUのグラフ処理性能の差を縮小すること。
メモリアクセスの最適化とカーネル結合を活用して、反復的グラフアルゴリズム（例：PageRank や SSSP）の低遅延実行を達成すること。

提案手法

グラフをスパース隣接行列として表現し、グラフアルゴリズムをスパース行列-ベクタ乗算（SpMV）として定式化する。
CUSP ライブラリを活用して、SpMV やその他の線形代数演算の高度に最適化されたGPUカーネルを活用する。
メモリコalescingを向上させ、メモリアクセスパターンの遅延を低減するための行列再順序付け技術を適用する。
カーネル結合を実装して、カーネル起動オーバーヘッドを最小限に抑え、デバイスメモリとレジスタ間のデータ移動を削減する。
圧縮スパース行（CSR）形式を用いてデータレイアウトを最適化し、コalescedメモリアクセスパターンを実現する。
グラフアルゴリズムの再利用性を高めるために、PageRank や SSSP などの複数のグラフアルゴリズムを、再利用可能な線形代数プリミティブに抽象化する。

実験結果

リサーチクエスチョン

RQ1GPU上で標準的な線形代数プリミティブを用いて、グラフアルゴリズムを表現・高速化できるか？
RQ2線形代数ベースのグラフフレームワークの性能は、手動最適化されたGPUグラフフレームワークと比べてどの程度か？
RQ3GPUベースのグラフ処理において、メモリアクセスパターンとカーネル起動オーバーヘッドはどの程度最適化可能か？
RQ4GPU上で高精度にチューニングされた線形代数カーネルにグラフワークロードをマッピングすることで、得られる最大の性能向上はどの程度か？
RQ5GraphBLAST は、既存のGPUグラフフレームワークと比較して、異なるグラフサイズと密度に対してどの程度スケーリングするか？

主な発見

GraphBLAST は、大規模な実世界のグラフにおいて、最高性能を示すGPUグラフフレームワーク比で最大12.5倍の高速化を達成している。
カーネル結合と効率的なメモリアクセスパターンにより、カーネル起動オーバーヘッドが70％削減された。
行列再順序付けによるメモリコalescingにより、現代のGPUアーキテクチャ上で帯域幅利用率が最大40％向上した。
GraphBLAST は、スパースおよび密なグラフを含む多様なグラフワークロードにおいて、一貫した性能を示した。
線形代数抽象化により、異なるGPUプラットフォーム間でのグラフアルゴリズムの迅速なプロトタイピングとポータビリティが可能になった。
大規模なソーシャルネットワークグラフにおいて、PageRank や SSSP の実行時間が、ベースラインのGPU実装比で最大10倍短縮された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。