QUICK REVIEW

[论文解读] Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Myeongjae Jeon, Shivaram Venkataraman|arXiv (Cornell University)|Jan 17, 2019

Cloud Computing and Resource Management被引用 118

一句话总结

本文分析了一个大型 Microsoft 多租户 GPU 集群，以了解局部性、调度和故障对 DNN 训练的影响，并为下一代调度器提供设计指南。

ABSTRACT

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

研究动机与目标

表征成群调度和局部性约束如何影响用于 DNN 训练的大型多租户 GPU 集群的排队与利用率。
评估 GPU 局部性、服务器级干扰与共置对 GPU 利用率和训练效率的影响。
识别 DNN 训练工作负载中的常见失败模式及其对集群利用率与重试策略的影响。
为新一代集群调度器提供设计指南，以在 DNN 工作负载中提升局部性、隔离性和早期故障检测。

提出的方法

分析来自 Microsoft 多租户 GPU 集群（Philly）的两个月追踪，覆盖约 100,000 个作业和 14 个虚拟集群。
将调度日志（YARN）与逐作业日志和 Ganglia 利用数据相关联，以研究局部性、排队和故障。
将排队延迟表征为公平份额和碎片化分量，并研究它们对 GPU 数量的依赖。
在不同放置场景（同服务器、不同服务器、服务器内/服务器间）下评估 GPU 和主机资源利用率。
将 Philly 与其他调度器进行比较，并为 DNN 工作负载中的局部性感知调度提供实用设计指南。

实验结果

研究问题

RQ1局部性约束和成群调度如何影响 DNN 训练作业的排队延迟？
RQ2局部性感知调度如何影响分布式多 GPU 作业的 GPU 利用率与训练性能？
RQ3在大型多租户 DNN 训练集群中，作业失败的主要原因是什么，以及它们如何影响利用率？
RQ4哪些调度器设计选择可以缓解碎片化、干扰与故障，以提高利用率和性能？

主要发现

排队延迟受局部性影响，放宽局部性会降低延迟，尤其是对于较大 GPU 作业（>4 个 GPU）。
在使用中的 GPU 上的平均 GPU 硬件利用率约为 52%，随着作业规模增大，同步和干扰导致利用率进一步下降。
碎片化延迟支配了许多作业的等待时间，尤其是 5–8 GPU 及更大配置；当配额耗尽时发生公平份额延迟。
约 30% 的作业以失败告终或被终止，但它们占用大量 GPU 时间，凸显故障带来的低效。
跨多服务器的分布式训练由于 RDMA/PCIe 争用和服务器间通信开销而降低 GPU 利用率；同服务器并置的作业进一步降低利用率。
大多数通过的作业在达到最佳损失前几乎需要全部轮次，这暗示存在提前结束以节省 GPU 时间的机会。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。