QUICK REVIEW

[论文解读] Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Stefan Hadjis, Ce Zhang|arXiv (Cornell University)|Jun 14, 2016

Advanced Neural Network Applications参考文献 23被引用 50

一句话总结

Omnivore 是一种用于多设备深度学习的自动优化器，相比最先进系统将训练速度提升 1.9× 至 12×。它通过将每个设备视为基于吞吐量的黑箱，利用预测模型在 CPU 和 GPU 上联合调优超参数和执行策略（包括调优动量的异步 SGD），以最小化端到端收敛时间。

ABSTRACT

We study the factors affecting training time in multi-device deep learning systems. Given a specification of a convolutional neural network, our goal is to minimize the time to train this model on a cluster of commodity CPUs and GPUs. We first focus on the single-node setting and show that by using standard batching and data-parallel techniques, throughput can be improved by at least 5.5x over state-of-the-art systems on CPUs. This ensures an end-to-end training speed directly proportional to the throughput of a device regardless of its underlying hardware, allowing each node in the cluster to be treated as a black box. Our second contribution is a theoretical and empirical study of the tradeoffs affecting end-to-end training time in a multiple-device setting. We identify the degree of asynchronous parallelization as a key factor affecting both hardware and statistical efficiency. We see that asynchrony can be viewed as introducing a momentum term. Our results imply that tuning momentum is critical in asynchronous parallel configurations, and suggest that published results that have not been fully tuned might report suboptimal performance for some configurations. For our third contribution, we use our novel understanding of the interaction between system and optimization dynamics to provide an efficient hyperparameter optimizer. Our optimizer involves a predictive model for the total time to convergence and selects an allocation of resources to minimize that time. We demonstrate that the most popular distributed deep learning systems fall within our tradeoff space, but do not optimize within the space. By doing this optimization, our prototype runs 1.9x to 12x faster than the fastest state-of-the-art systems.

研究动机与目标

解决深度学习系统中缺乏自动化配置的问题，该问题常导致次优性能，训练时间延长多达 10×。
阐明分布式训练中硬件效率（FLOPS）与统计效率（收敛速度）之间的权衡。
证明当配置得当时，基于 CPU 的深度学习可与基于 GPU 的训练一样高效且更具成本效益。
揭示当算法动量被正确调优时，SGD 中的异步性不会带来统计损失，从而挑战了先前的假设。
开发一种自动优化器，可在不同硬件类型和训练策略之间选择最优配置，以最小化总训练时间。

提出的方法

设计原型系统 Omnivore，将每个设备（CPU 或 GPU）视为仅基于其吞吐量的黑箱，实现与硬件无关的训练。
在 CPU 上实现标准批处理和数据并行性，使吞吐量比最先进系统高出 5.5×，使 CPU 训练变得可行且高效。
从理论上和实证上表征 SGD 中的异步性引入了隐式动量项，表明调优显式动量可消除性能损失。
构建一个收敛时间预测模型，联合优化超参数（如学习率、动量）和执行策略（如同步与异步训练）。
使用优化器选择可最小化总训练时间的配置，证明现有系统在权衡空间内未实现最优。
在多种硬件设置（EC2 CPU 和 GPU 实例）上进行评估，显示 Omnivore 在 CPU 集群上实现 5× 加速，在 GPU 集群上实现 5.6× 加速，相比基线配置。

实验结果

研究问题

RQ1当采用适当的批处理和吞吐量感知设计时，基于 CPU 的深度学习能否实现与基于 GPU 的训练相当的性能？
RQ2分布式 SGD 中的异步性与算法动量之间有何关系？调优动量是否能消除异步性的统计成本？
RQ3在多设备深度学习系统中，硬件效率（FLOPS）与统计效率（收敛速度）如何权衡？
RQ4为何现有深度学习系统尽管处于同一权衡空间，却仍表现欠佳？哪些配置选择导致次优性能？
RQ5一种能联合调优超参数和执行策略的自动优化器，是否能显著减少跨多样化硬件平台的端到端训练时间？

主要发现

通过应用标准批处理和数据并行性，Omnivore 在 CPU 上实现的吞吐量比最先进系统高出 5.5×，使优化后的 CPU 训练速度与 GPU 训练相当。
当算法动量被调优时，SGD 中的异步性不会带来统计损失——这一发现解决了长期存在的争议，并解释了为何先前研究报道异步训练性能更差。
最小化训练时间的最优配置位于硬件效率与统计效率的权衡空间中，而现有系统未能在此空间内实现优化。
与最快现有系统相比，Omnivore 将端到收敛时间减少了 1.9× 至 12×，在先前系统发散或表现欠佳的配置中收益最大。
在 EC2 等云平台中，由于 FLOPS/美元比率更优，使用 Omnivore 的基于 CPU 的训练比基于 GPU 的训练便宜 2.1×，使 CPU 成为更具成本效益的替代方案。
在最优配置下比较 GPU 和 CPU 集群，GPU 集群在原始速度上快 5.6×（不考虑统计低效性），且每轮迭代成本低 1.8×，因其具有更高的 FLOPS/美元效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。