Skip to main content
QUICK REVIEW

[论文解读] AGL: a Scalable System for Industrial-purpose Graph Machine Learning

Dalong Zhang, Xin Huang|arXiv (Cornell University)|Mar 5, 2020
Advanced Graph Neural Networks参考文献 30被引用 32
一句话总结

AGL 是一个可扩展的容错系统,用于工业图机器学习,提供基于 k-hop 邻域和 MapReduce 的全功能训练与 GNN 推理,可在通用集群上完成。

ABSTRACT

Machine learning over graphs have been emerging as powerful learning tools for graph data. However, it is challenging for industrial communities to leverage the techniques, such as graph neural networks (GNNs), and solve real-world problems at scale because of inherent data dependency in the graphs. As such, we cannot simply train a GNN with classic learning systems, for instance parameter server that assumes data parallel. Existing systems store the graph data in-memory for fast accesses either in a single machine or graph stores from remote. The major drawbacks are in three-fold. First, they cannot scale because of the limitations on the volume of the memory, or the bandwidth between graph stores and workers. Second, they require extra development of graph stores without well exploiting mature infrastructures such as MapReduce that guarantee good system properties. Third, they focus on training but ignore the optimization of inference over graphs, thus makes them an unintegrated system. In this paper, we design AGL, a scalable, fault-tolerance and integrated system, with fully-functional training and inference for GNNs. Our system design follows the message passing scheme underlying the computations of GNNs. We design to generate the $k$-hop neighborhood, an information-complete subgraph for each node, as well as do the inference simply by merging values from in-edge neighbors and propagating values to out-edge neighbors via MapReduce. In addition, the $k$-hop neighborhood contains information-complete subgraphs for each node, thus we simply do the training on parameter servers due to data independency. Our system AGL, implemented on mature infrastructures, can finish the training of a 2-layer graph attention network on a graph with billions of nodes and hundred billions of edges in 14 hours, and complete the inference in 1.2 hour.

研究动机与目标

  • 应对工业级图形网络(GNN)在拥有数十亿个节点和数百亿条边的工业级图上的扩展挑战。
  • 基于成熟基础设施(MapReduce、参数服务器)提出一个集成的系统设计(训练与推理),以确保容错性和可扩展性。
  • 引入 k-hop 邻域的概念,以实现数据独立性并简化子图上的训练。
  • 开发三个核心模块(GraphFlat、GraphTrainer、GraphInfer),实现邻域的可扩展生成、培训和推理。
  • 提高大规模带属性图的训练与推理的效率和吞吐量。

提出的方法

  • 使用 GNN 的消息传递范式,为每个目标节点定义并生成信息完备的 k-hop 邻域。
  • 实现 GraphFlat,将 k-hop 邻域分布式生成并展平为存储在分布式文件系统上的 protobufs。
  • 在 GraphFlat 中应用重新索引、采样和倒排索引来处理枢纽节点并平衡负载。
  • 开发 GraphTrainer 作为一个分布式参数服务器式框架,具备流水线、剪枝和边分区优化,在通用硬件上对 k-hop 邻域进行训练。
  • 设计 GraphInfer,通过分层模型分割和基于 MapReduce 的消息传递管线执行分布式推理,逐层传播嵌入。

实验结果

主要发现

  • AGL 能在拥有 6.23e9 个节点和 3.38e11 条边的图上,使用 100 个工作器,在 14 小时内训练一个 2 层的 Graph Attention Network (GAT)。
  • 在同一图上的推理完成时间为 1.2 小时。
  • 与单机训练相比,该系统在 CPU 集群上实现近线性加速(相对于 DGL/PyG 在超大机器上的训练)。
  • GraphFlat 与 GraphTrainer、GraphInfer 结合 MapReduce 和参数服务器基础设施,为工业规模图提供容错、可扩展的训练和推理。
  • GraphInfer 能从跨节点的嵌入重用中受益,最大化推理效率,相较于如 DGL 和 AliGraph 的架构。
  • 该方法代表了在实际工业场景中图嵌入部署规模最大的案例之一。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。