QUICK REVIEW

[论文解读] DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs

Da Zheng, Chao Ma|arXiv (Cornell University)|Oct 11, 2020

Advanced Graph Neural Networks参考文献 34被引用 23

一句话总结

DistDGL 是基于 DGL 构建的分布式 GNN 训练系统，通过采用局部感知的图划分、数据与计算的共位置部署以及多约束负载均衡，实现了在百亿规模图上的高效、可扩展的迷你批次训练。它在 16 台机器上实现了线性加速，在 13 秒内完成了一个 1 亿节点、30 亿边图的训练周期，同时保持了模型精度。

ABSTRACT

Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To tackle this challenge, we develop DistDGL, a system for training GNNs in a mini-batch fashion on a cluster of machines. DistDGL is based on the Deep Graph Library (DGL), a popular GNN development framework. DistDGL distributes the graph and its associated data (initial features and embeddings) across the machines and uses this distribution to derive a computational decomposition by following an owner-compute rule. DistDGL follows a synchronous training approach and allows ego-networks forming the mini-batches to include non-local nodes. To minimize the overheads associated with distributed computations, DistDGL uses a high-quality and light-weight min-cut graph partitioning algorithm along with multiple balancing constraints. This allows it to reduce communication overheads and statically balance the computations. It further reduces the communication by replicating halo nodes and by using sparse embedding updates. The combination of these design choices allows DistDGL to train high-quality models while achieving high parallel efficiency and memory scalability. We demonstrate our optimizations on both inductive and transductive GNN models. Our results show that DistDGL achieves linear speedup without compromising model accuracy and requires only 13 seconds to complete a training epoch for a graph with 100 million nodes and 3 billion edges on a cluster with 16 machines. DistDGL is now publicly available as part of DGL:https://github.com/dmlc/dgl/tree/master/python/dgl/distributed.

研究动机与目标

解决在单机内存和计算能力无法容纳的百亿规模图上高效训练 GNN 的挑战。
降低分布式 GNN 训练中的通信开销，该开销主要由节点依赖导致的邻居节点特征获取所主导。
通过在集群中将图数据、特征和计算共位置部署，实现高并行效率和内存可扩展性。
通过应用多约束图划分和均衡的迷你批次生成，实现计算负载在各机器间的均衡分配。
在使用同步、迷你批次训练并包含自中心网络中非本地节点的情况下，保持模型精度的同时实现大规模图的训练扩展。

提出的方法

使用一致的所有者-计算规则，将图结构、节点特征和嵌入分布在集群的多台机器上，实现计算分解。
采用基于 METIS 的最小割图划分方法，并引入多种负载均衡约束，以最小化跨分区通信并确保负载均衡。
在各分区之间复制 halo 节点，以减少消息传递过程中重复远程数据获取的开销。
使用稀疏嵌入更新机制，以减少同步 SGD 训练期间的梯度通信开销。
将采样服务器和内存中的 KVStore 服务器与训练器部署在同一台机器上，实现数据与计算的共位置部署。
通过 DGL 兼容的 API，支持归纳式和直推式 GNN 模型，且仅需极少代码修改。

实验结果

研究问题

RQ1如何在大规模场景下实现高效的分布式 GNN 训练，特别是当邻居数据获取带来的通信开销占主导地位时？
RQ2在百亿规模图上，使用多约束负载均衡的最小割图划分是否能显著提升训练性能？
RQ3通信减少与负载均衡的联合优化在多大程度上能够实现分布式 GNN 训练的线性加速？
RQ4DistDGL 在 16 台机器上对包含 1 亿节点和 30 亿边的图进行扩展时，如何保持模型精度？
RQ5基于 DGL 构建的系统是否能够实现从单机训练到分布式训练的无缝迁移，且仅需极少代码修改？

主要发现

DistDGL 在 16 台机器上实现了线性加速，对于一个 1 亿节点、30 亿边的图，仅用 13 秒就完成了一个训练周期，且未牺牲模型精度。
在 ogbn-product 图上，采用多约束负载均衡的 METIS 划分相比默认 METIS 提升了 4% 的性能，相比随机划分提升了 2.14 倍。
在 ogbn-papers100M 图上，由于负载严重不均衡，默认 METIS 的性能甚至劣于随机划分，凸显了多约束负载均衡的必要性。
通过最小割划分实现的通信减少与通过多约束实现的负载均衡相结合，是实现高性能的关键。
复制 halo 节点并使用稀疏嵌入更新可显著减少训练过程中的通信流量。
DistDGL 保持了与 DGL API 的兼容性，使得从单机训练迁移到分布式训练的代码修改量极小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。