QUICK REVIEW

[论文解读] Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Arpan Gujarati, Reza Karimi|arXiv (Cornell University)|Jun 3, 2020

Context-Aware Activity Recognition Systems参考文献 37被引用 30

一句话总结

Clockwork 表明 DNN 推理在 GPU 上具有确定性执行，并设计了一个集中式控制器与每个 GPU 的工作器，以实现可预测的端到端延迟，满足数千个模型的严格 SLO。它推出一种预测性、由下而上的体系结构，通过约束可变性发生的位置来最小化尾部延迟。

ABSTRACT

Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications depend must consistently meet low latency targets. Existing model serving architectures use well-known reactive techniques to alleviate common-case sources of latency, but cannot effectively curtail tail latency caused by unpredictable execution times. Yet the underlying execution times are not fundamentally unpredictable - on the contrary we observe that inference using Deep Neural Network (DNN) models has deterministic performance. Here, starting with the predictable execution times of individual DNN inferences, we adopt a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance. We evaluate our implementation, Clockwork, using production trace workloads, and show that Clockwork can support thousands of models while simultaneously meeting 100ms latency targets for 99.9999% of requests. We further demonstrate that Clockwork exploits predictable execution times to achieve tight request-level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.

研究动机与目标

证明 DNN 推理在 GPU 上具有确定性执行时间。
提出在分布式服务系统中通过整合选择以保持可预测性的设计原则。
展示 Clockwork 的集中控制器和可预测工作器的体系结构。
在接近生产的工作负载上评估 Clockwork，并在延迟和吞吐量方面与现有系统进行比较。

提出的方法

论证并量化单个 DNN 推理在 GPUs 上的确定性。
提出在上层整合选择以约束下层的可变性（限制调度和内存管理决策）。
以集中控制器和每个 GPU 的工作器实现 Clockwork，每次执行一个 Load 和一个 Infer 操作。
使用基于 TVM 构建的模型运行时，使用预分配的内存和静态工作空间来编译和运行用户模型。
强制执行基于操作的严格接口（Load、Unload、Infer），带有最早/最晚执行窗口且不提供尽力而为的修复。
使用生产追踪工作负载进行评估，测量延迟目标和模型数量可扩展性。

实验结果

研究问题

RQ1在分布式服务环境中，DNN 推理可以被视为可预测、确定性的执行吗？
RQ2随着工作负载跨越多模型扩展，如何整合系统设计选择以保持可预测的性能？
RQ3Clockwork 是否在每个 GPU 支持数千个模型的同时实现紧凑的端到端延迟 SLO 和强请求级隔离？
RQ4在集中调度决策时，可预测性与模块化之间的权衡是什么？
RQ5Clockwork 在延迟、有效吞吐和模型共享方面与以前的模型服务系统相比如何？

主要发现

单独推理的延迟高度可预测（在 V100 GPU 上，99.99 百分位数与中位数的差值很小）。
集中控制器采用一次一个 Infer 操作显著降低了由 GPU、操作系统及其它层引起的变异性，从而改善尾部延迟。
Clockwork 能在现实工作负载下，使每个 GPU 支持数千个模型，并在 99.9999% 请求中实现 sub-100 ms 的延迟。
整合选择减少对尽力而为机制的需求，使执行成为可预测、可提前调度的。
Clockwork 与 Clipper 和 INFaaS 相比，在达到延迟目标方面表现更佳，同时实现相似或更好的 goodput 与资源共享。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。