QUICK REVIEW

[论文解读] DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Udit Gupta, Samuel Hsia|arXiv (Cornell University)|Jan 8, 2020

Stochastic Gradient Optimization Techniques参考文献 55被引用 38

一句话总结

论文提出 DeepRecInfra，一种端到端的用于大规模神经推荐推断的基础设施，以及 DeepRecSched，一个爬山式调度器，在最大化吞吐量、同时满足尾部延迟目标的前提下，优化每个请求的批量大小和 GPU 下放。

ABSTRACT

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

研究动机与目标

在大型数据中心推动大规模神经推荐推断优化的必要性。
提出一个端到端的基础设施（DeepRecInfra），能够体现行业代表性的模型、工作负载和尾部延迟目标。
开发一个动态调度器（DeepRecSched），在硬件加速器上共同设计请求级与批量级并行性。
在生产规模设置中展示跨多模型与多硬件配置的吞吐量提升和延迟降低。

提出的方法

表征八种最前沿的推荐模型，以捕捉模型层面的异质性与瓶颈。
开发 DeepRecInfra，以建模行业工作负载、查询到达模式（泊松分布）以及来自生产数据中心的查询大小分布。
引入 DeepRecSched，一个爬山式调度器，用于在尾部延迟目标下调整每请求批量大小和 GPU 下放阈值以最大化 QPS。
在 DeepRecInfra 中对 CPU 和 GPU（GTX-1080Ti）及 Broadwell 和 Skylake CPU 进行评估，且与静态基线进行比较。
分析硬件异质性（CPU SIMD 宽度、缓存层次结构）如何影响请求级与批量级并行性的最优平衡。
证明 DeepRecSched 相对于基线在吞吐量方面有显著提升，并改善功耗效率。

实验结果

研究问题

RQ1在规模化推荐推断中，与其他深度神经网络工作负载相比，模型架构、内存访问和输入特征有何不同？
RQ2端到端基础设施（DeepRecInfra）是否能够如实建模现实世界的生产推荐工作负载以实现大规模推断？
RQ3基于爬山法的调度器（DeepRecSched）是否能够通过在多样化模型和硬件之间调整批量大小和加速器下放来最大化在尾部延迟约束下的吞吐量？
RQ4在现实的查询分布下，使用 GPU 与 CPU+GPU 混合架构进行大规模推荐推断的吞吐量和能效收益是多少？
RQ5硬件异质性如何影响请求级并行性与批量级并行性的工作分配最优？

主要发现

DeepRecInfra 能建模八种行业代表性模型、现实的尾部延迟目标，以及生产环境类似的查询模式。
DeepRecSched 在严格延迟目标下将系统吞吐量翻倍，并在所有八种模型上优于静态调度器。
CPU 与 GPU 的评估显示 DeepRecSched-CPU 产生 1.7–2.7 倍吞吐提升，而 DeepRecSched-GPU 产生 4.0–5.8 倍，取决于延迟目标和模型。
GPU 加速对较大的查询最有利，存在一个随模型和尾部延迟目标而异的最优 GPU 下放阈值。
下放到 GPU 能提高吞吐量，但存在数据传输开销；最优阈值在加速带来的速度提升与传输成本之间取得平衡。
最优批量大小和下放阈值取决于模型架构、尾部延迟目标和硬件平台。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。