QUICK REVIEW

[論文レビュー] Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems

Abdullah Gharaibeh, Reza, Tahsin|arXiv (Cornell University)|Dec 11, 2013

Graph Theory and Algorithms参考文献 53被引用数 38

ひとこと要約

本論文では、パフォーマンスモデル、インテリジェントなパーティショニング、ワークロードに配慮したタスク分散を活用してパフォーマンスを最適化する、ハイブリッドCPU-GPUグラフ処理フレームワークTOTEMを提案する。大規模なグラフ（最大160億エッジ）においてGPU加速を活用することで、負荷のバランスとメモリローカリティを改善し、最大12.5倍の高速化を達成する。

ABSTRACT

The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. Moreover, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult. This work starts from the hypothesis that hybrid platforms (e.g., GPU-accelerated systems) have both the potential to cope with the heterogeneous structure of real graphs and to offer a cost-effective platform for high-performance graph processing. This work assesses this hypothesis and presents an extensive exploration of the opportunity to harness hybrid systems to process large-scale graphs efficiently. In particular, (i) we present a performance model that estimates the achievable performance on hybrid platforms; (ii) informed by the performance model, we design and develop TOTEM - a processing engine that provides a convenient environment to implement graph algorithms on hybrid platforms; (iii) we show that further performance gains can be extracted using partitioning strategies that aim to produce partitions that each matches the strengths of the processing element it is allocated to, finally, (iv) we demonstrate the performance advantages of the hybrid system through a comprehensive evaluation that uses real and synthetic workloads (as large as 16 billion edges), multiple graph algorithms that stress the system in various ways, and a variety of hardware configurations.

研究の動機と目的

メモリローカリティが低く、データ依存性の高い並列処理を示す不規則な大規模グラフを効率的に処理する課題に対処する。
従来のCPUオンリーまたはGPUオンリーのシステムが、多様なグラフワークロードを処理する際の限界を克服する。
CPUとGPUの長所を活かすことで最適なパフォーマンスと負荷バランスを実現するハイブリッドプラットフォームを設計する。
システム設計を支援し、ハイブリッドアーキテクチャ上で達成可能なスループットを予測するためのパフォーマンスモデルを構築する。
下位の処理ユニットの計算能力に適合するようにグラフパーティションを割り当てるパーティショニング戦略の有効性を示す。

提案手法

メモリ帯域幅、計算能力、データ移動コストを考慮したパフォーマンスモデルを提案し、ハイブリッドCPU-GPUシステムにおける達成可能なスループットを推定する。
低レベルのGPUプログラミングを抽象化し、ハイブリッドプラットフォーム上でグラフアルゴリズムの高水準実装を可能にする処理エンジンTOTEMを設計する。
計算特性とアクセスパターンに基づき、グラフパーティションをCPUまたはGPUに割り当てるパーティショニングヒューリスティクスを実装する。
データに配慮したパーティショニングにより、メモリアクセスの局所性を向上させ、処理ユニット間の負荷不均衡を低減する。
ワークロードの種別とハードウェア能力に基づき、グラフ演算を最も適した処理要素（CPUまたはGPU）にマッピングするタスクスケジューリングを統合する。
柔軟なランタイム抽象化を通じて、不規則なメモリアクセスや低演算強度を示すワークロードを含む、複数のグラフアルゴリズムとワークロードをサポートする。

実験結果

リサーチクエスチョン

RQ1ハイブリッドCPU-GPUシステムは、実世界のグラフで一般的な不規則なメモリアクセスパターンとデータ依存性の高い並列処理を効果的に処理できるか？
RQ2インテリジェントなワークロード分散とパーティショニングによって、ハイブリッドプラットフォーム上のパフォーマンスを最大限に引き出せるか？
RQ3グラフ処理ワークロードに対して、ハイブリッドCPU-GPUアーキテクチャのスループットを正確に予測できるパフォーマンスモデルは何か？
RQ4計算強度とメモリアクセスパターンに基づくグラフ特性に合ったパーティショニング戦略が、負荷バランスとメモリローカリティをどの程度改善できるか？
RQ5大規模なグラフにおいて、ハイブリッドシステムはCPUオンリーまたはGPUオンリーの構成と比較して、パフォーマンスとスケーラビリティの点で優れているか？

主な発見

TOTEMは、最大160億エッジの実世界のグラフにおいて、CPUオンリーのベースラインと比較して最大12.5倍の高速化を達成する。
パフォーマンスモデルはシステムスループットを正確に予測でき、効果的なシステム設計とチューニングを可能にする。
計算強度とメモリアクセスパターンに基づくパーティショニング戦略により、負荷バランスが著しく向上し、GPUのアイドル時間が削減される。
低演算強度を示すアルゴリズムを含む多様なグラフアルゴリズムにおいて、ハイブリッドシステムはCPUオンリーおよびGPUオンリーの構成を上回るパフォーマンスを発揮する。
社会的ネットワークやウェブグラフを含む、合成的および実世界のワークロードの両方で、フレームワークは一貫したパフォーマンス向上を示す。
インテリジェントなパーティショニングにより、平均で40％のメモリアクセスローカリティの向上が達成され、外部メモリ帯域幅の圧力を低減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。