QUICK REVIEW

[论文解读] GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu|arXiv (Cornell University)|Mar 15, 2018

Molecular Communication and Nanonetworks参考文献 42被引用 81

一句话总结

GossipGraD 引入一种基于流言的 SGD，减少每步通信至 O(1)，在大型 GPU 集群上实现近乎完美的效率，并保持 SGD 级精度。

ABSTRACT

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from Θ(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners for facilitating direct diffusion of gradients, 4) asynchronous distributed shuffle of samples during the feedforward phase in SGD to prevent over-fitting, 5) asynchronous communication of gradients for further reducing the communication cost of SGD and GossipGraD. We implement GossipGraD for GPU and CPU clusters and use NVIDIA GPUs (Pascal P100) connected with InfiniBand, and Intel Knights Landing (KNL) connected with Aries network. We evaluate GossipGraD using well-studied dataset ImageNet-1K (~250GB), and widely studied neural network topologies such as GoogLeNet and ResNet50 (current winner of ImageNet Large Scale Visualization Research Challenge (ILSVRC)). Our performance evaluation using both KNL and Pascal GPUs indicates that GossipGraD can achieve perfect efficiency for these datasets and their associated neural network topologies. Specifically, for ResNet50, GossipGraD is able to achieve ~100% compute efficiency using 128 NVIDIA Pascal P100 GPUs - while matching the top-1 classification accuracy published in literature.

研究动机与目标

通过解决 SGD 中的通信瓶颈来推动可扩展的分布式深度学习。
设计一种在保持收敛性属性的同时实现常量通信复杂度的基于流言的 SGD 变体。
引入异步数据洗牌和伙伴轮换以改善信息扩散并防止过拟合。
提供一个具体的 GPU/CPU 实现并在大规模数据集上进行评估。
在 ImageNet 上对 ResNet50/GoogLeNet 进行理论上的收敛性证明并验证经验表现。

提出的方法

提出 GossipGraD，使通信复杂度保持常量 O(1)，通过每步与单一伙伴交换更新并在 log(p) 步中实现间接扩散。
使用分层虚拟拓扑（超立方体或扩散拓扑）以在 log(p) 步内确保梯度扩散。
引入每 log(p) 步一次的伙伴轮换，以实现对所有节点的直接扩散。
应用样本的异步分布式内存洗牌以防止过拟合，并使洗牌与前馈过程重叠。
在 CPU (KNL) 和 GPU (Pascal P100) 上实现 GossipGraD，使用 MPI 非阻塞调用和可选的异步进程线程。
提供理论收敛性论证，表明收敛到类似 SGD 的局部最小值。

实验结果

研究问题

RQ1GossipGraD 是否在保持类似 SGD 收敛的同时实现常量通信复杂度？
RQ2异步扩散和伙伴轮换是否在大规模下改善梯度扩散和收敛？
RQ3GossipGraD 在 ImageNet 规模数据集上使用如 GoogLeNet 和 ResNet50 的常规结构时表现如何？
RQ4在大规模 GPU/CPU 集群上扩展 GossipGraD 时可实现的计算效率是多少？

主要发现

GossipGraD 实现了每步 O(1) 的通信，并支持与通信重叠的计算。
在 128 张 NVIDIA Pascal P100 GPU 上，GossipGraD 对 ResNet50 实现了约 100% 的计算效率。
GossipGraD 在 ImageNet 实验中与 ResNet50 和 GoogLeNet 的公开的 top-1 准确率相匹配。
实验包括 ImageNet-1K、GoogLeNet 和 ResNet50，分别在 Pascal GPU 和 Intel KNL 上，并实现了通信与计算的完全重叠。
理论分析与实证结果表明 GossipGraD 收敛到与 SGD 相似的局部最小值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。