QUICK REVIEW

[论文解读] Scheduling Splittable Jobs on Configurable Machines

Cheng Tan, Zhichao Li|arXiv (Cornell University)|Sep 18, 2021

Scheduling and Optimization Algorithms参考文献 31被引用 14

一句话总结

本文提出 MIG-serving，一种利用多实例 GPU（MIG）分区技术优化 NVIDIA A100 GPU 上深度神经网络（DNN）推理的系统。该系统结合启发式、遗传算法与蒙特卡洛树搜索（MCTS）算法，生成满足服务等级目标（SLO）的低成本 GPU 配置，在保证性能的前提下，相比将 A100 作为单体使用，可将 GPU 使用量减少高达 40%。

ABSTRACT

Motivated by modern architectures allowing for the partitioning of a GPU into hardware separated instances, we initiate the study of scheduling splittable jobs on configurable machines. We consider machines that can be configured into smaller instances, which we call blocks, in multiple ways, each of which is referred to as a configuration. We introduce the Configurable Machine Scheduling (cms) problem, where we are given n jobs and a set C of configurations. A schedule consists of a set of machines, each assigned some configuration in C with each block in the configuration assigned to process one job. The amount of a job’s demand that is satisfied by a block is given by an arbitrary function of the job and block. The objective is to construct a schedule using as few machines as possible. We provide a tight logarithmic factor approximation algorithm for this problem in the general setting, a factor (3 + ε) approximation algorithm for arbitrary ε > 0 when there are O(1) input configurations, and a polynomial time approximation scheme when both the number and size of configurations are O(1). Finally, we utilize a technique for finding conic integer combinations in fixed dimension to develop an optimal polynomial time algorithm in the case with O(1) jobs, O(1) blocks, and every configuration up to a given size.

研究动机与目标

为解决在受 MIG 限制的 A100 GPU 上高效调度 DNN 推理工作负载的挑战，该挑战因硬件约束导致非平凡的分区决策。
定义并形式化可重构机器调度问题（RMS），以捕捉非线性性能扩展、受限的分区规则以及部分重构能力。
设计一种系统，最小化满足多个并发运行的 DNN 模型 SLO（吞吐量与延迟）所需的 GPU 数量。
确保在配置更新期间，部署切换过程平滑且透明，避免服务中断。
在基于 Kubernetes 的真实集群中评估系统的性能与效率，并与使用完整 A100 的基线配置进行对比。

提出的方法

提出两阶段优化流程：第一阶段使用快速贪心启发式算法进行初始部署，第二阶段使用缓慢迭代的遗传算法（GA）对解进行细化与优化。
将蒙特卡洛树搜索（MCTS）作为高精度搜索组件，用于探索复杂配置空间，以实现最优 GPU 分区。
在遗传算法中采用自定义实现，通过交叉与变异操作组合父代解，由基于 SLO 满足度与 GPU 利用率的适应度函数引导进化过程。
在控制器模块中引入一种新颖的“交换与压缩”算法，实现部署间透明、无中断的切换。
在 Kubernetes 上实现系统，以管理集群中 MIG 实例的实时调度与编排。
基于来自 49 个模型在 PyTorch 和 TensorFlow Hub 上的实测基准，建立每个实例大小的 DNN 性能模型，捕捉非线性吞吐量扩展特性。

实验结果

研究问题

RQ1如何高效调度异构的 DNN 工作负载于 MIG 支持的 A100 GPU 上，以在满足 SLO 的前提下最小化 GPU 使用量？
RQ2MIG 分区的关键约束与特性是什么，导致传统调度算法不适用？
RQ3结合启发式、遗传算法与 MCTS 的混合算法管道是否能在成本效率与配置质量方面超越基线方法？
RQ4如何在运行时重构期间使部署切换对终端用户透明？
RQ5与将 A100 作为单体使用相比，MIG-serving 能在多大程度上减少 GPU 占用空间？

主要发现

与将 A100 作为整体使用（A100-7/7）相比，MIG-serving 可将 GPU 使用量减少高达 40%，在所有评估模型中实现了最高的成本效率。
系统在跨 MIG 分区部署 49 种多样化的 DNN 模型（包括 ResNet-50、BERT-base 和 BERT-large）时，成功满足了所有 SLO 要求。
在两个真实工作负载之间的部署切换在 30 分钟内完成，且未观察到任何服务中断。
DNN 模型在 MIG 实例上的性能并非随资源分配线性扩展，验证了基于实例大小感知调度的必要性。
可重构机器调度问题（RMS）为 NP-难问题，由于受限的分区规则与非线性性能曲线，无法通过传统资源分配启发式方法求解。
快速启发式与基于 MCTS 的慢速遗传算法相结合，显著提升了随时间推移的部署质量，慢速算法可达到近似最优配置。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。