QUICK REVIEW

[论文解读] Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Weijie Zhao, Deping Xie|arXiv (Cornell University)|Mar 12, 2020

Advanced Image and Video Retrieval Techniques参考文献 49被引用 76

一句话总结

本文提出了一种分布式分层GPU参数服务器（HBM-PS MEM-PS SSD-PS）来训练TB级别的稀疏 CTR 模型，在训练速度提升 1.8–4.8x 和性价比提升 4–9x 方面优于 MPI 集群。

ABSTRACT

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

研究动机与目标

解释在单个节点上训练超大规模 CTR 模型，超出单节点的 GPU 内存和 CPU 内存容量的需求。
提出一个三层级的分层存储设计（HBM、内存、SSD），以实现以 GPU 为中心的海量稀疏模型训练。
开发高效的节点内和跨节点的 GPU 参数同步以加速训练。
在真实广告数据集上评估可扩展性并与标准 MPI 集群基线进行比较。

提出的方法

设计一个四阶段流水线来重叠数据传输、参数加载和 GPU 计算。
在跨 GPU 的 HBM 中实现一个多 GPU 分布式哈希表，用于存储带有原子更新的工作参数。
使用 RDMA 通过 all-reduce 操作实现跨节点 GPU 参数同步。
将参数聚簇成文件存储在 SSD 上，并采用文件级参数管理和后台合并来管理过时数据。
使用模哈希将参数在 GPU 和节点之间分区，以将键映射到存储位置。

实验结果

研究问题

RQ1分层 GPU 参数服务器是否能够在不牺牲精度的情况下高效训练 TB 级 CTR 模型？
RQ2将 HBM-PS、MEM-PS 和 SSD-PS 整合在一起，相较于传统的基于 MPI 的训练，在性能和成本方面有哪些好处？
RQ3数据传输、缓存和 I/O 策略如何影响在真实广告数据上的整体训练吞吐量？

主要发现

4 节点分层 GPU 参数服务器在五个 CTR 模型上的训练速度比 MPI 集群基线提升 1.8–4.8x。
成本标准化的加速比相对于 MPI 方案在 4.4x 到 9.0x 之间。
分层系统的相对 AUC 精度在 MPI 基线的误差不超过 0.1%，部分模型略高于 MPI，表明无损训练。
HBM-PS 表现为拉取/推送操作随非零特征数量的改变而扩展，而训练时间随密集参数数量变化。
MEM-PS 和 SSD-PS 通过缓存和文件级参数管理降低 SSD I/O 的影响，使得在主内存之外进行训练成为可能。
实验使用 4 个 GPU 节点（每节点 8×32 GB HBM）和 5 个 CTR 模型，稀疏参数范围从 8e9 到 1e11，展示了可扩展性和效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。