QUICK REVIEW

[论文解读] Portable Acceleration of CMS Computing Workflows with Coprocessors as a Service

Hayrapetyan, Aram, Tumasyan, Armen|arXiv (Cornell University)|Jan 1, 2024

Scientific Computing and Data Management被引用 1

一句话总结

该论文提出了一种轻量级、作为服务的框架，利用面向协处理器的优化网络推理服务（SONIC），通过将任务卸载到远程或本地GPU，以加速CMS计算工作流中的机器学习推理。该方法在Mini-AOD生产工作流中实现了高达3.5倍的吞吐量提升，通信开销极低，实现了高协处理器利用率，并支持跨CPU与多种协处理器的架构可移植性。

ABSTRACT

Computing demands for large scientific experiments, such as the CMS experiment at the CERN LHC, will increase dramatically in the next decades. To complement the future performance increases of software running on central processing units (CPUs), explorations of coprocessor usage in data processing hold great potential and interest. Coprocessors are a class of computer processors that supplement CPUs, often improving the execution of certain functions due to architectural design choices. We explore the approach of Services for Optimized Network Inference on Coprocessors (SONIC) and study the deployment of this as-a-service approach in large-scale data processing. In the studies, we take a data processing workflow of the CMS experiment and run the main workflow on CPUs, while offloading several machine learning (ML) inference tasks onto either remote or local coprocessors, specifically graphics processing units (GPUs). With experiments performed at Google Cloud, the Purdue Tier-2 computing center, and combinations of the two, we demonstrate the acceleration of these ML algorithms individually on coprocessors and the corresponding throughput improvement for the entire workflow. This approach can be easily generalized to different types of coprocessors and deployed on local CPUs without decreasing the throughput performance. We emphasize that the SONIC approach enables high coprocessor usage and enables the portability to run workflows on different types of coprocessors.

研究动机与目标

应对高能物理实验（如CMS）中日益增长的计算需求，其中机器学习推理在关键工作流中占处理时间的约10%。
通过将计算与客户端解耦，克服直接协处理器-CPU耦合带来的利用率低下和扩展性差等问题。
通过标准化的作为服务模型，实现在异构协处理器（如GPU、FPGA）上高效、可扩展且可移植的机器学习推理部署。
通过动态分配推理工作负载至远程或本地协处理器服务器，优化大规模数据处理中的GPU利用率。
证明SONIC框架在保持高性能和低延迟的同时，能够实现不同硬件平台间的算法可移植性。

提出的方法

以客户端-服务器模型部署SONIC框架，CPU客户端通过网络将推理请求发送至专用协处理器服务器（如GPU）。
在CMSSW软件框架内实现SONIC堆栈，使用gRPC实现低延迟通信，并采用NVIDIA Triton Inference Server进行模型服务。
在Mini-AOD生产工作流中，将特定的机器学习推理任务（如ParticleNet及其他喷注标记模型）从CPU卸载至远程或本地GPU。
使用ONNX模型和TensorRT进行模型优化，确保在GPU加速器上实现高推理吞吐量和低延迟。
在多种环境中开展实验：Google Cloud、普渡大学Tier-2计算中心，以及混合部署，以验证可扩展性和性能表现。
测量端到端工作流的吞吐量和延迟，对比在不同负载和网络条件下，纯CPU执行与GPU加速推理的性能差异。

实验结果

研究问题

RQ1SONIC作为服务的模型是否能有效加速类似CMS Mini-AOD的大规模高能物理数据处理流水线中的机器学习推理工作负载？
RQ2通过SONIC将机器学习推理卸载到远程或本地GPU时，吞吐量和延迟方面的性能提升程度如何？
RQ3基于网络的推理所引入的通信开销与GPU加速带来的性能收益相比，其影响程度如何？
RQ4SONIC框架在异构计算环境中，对多种协处理器类型（如GPU、FPGA）的可移植性和高效利用率，能够达到何种程度？
RQ5基于SONIC的方法能否在保持高GPU利用率和低资源争用的前提下，扩展至生产级工作负载？

主要发现

通过将机器学习推理卸载至GPU，SONIC框架在端到端Mini-AOD工作流中实现了最高达3.5倍的吞吐量提升，且通信开销极低。
单个机器学习模型（如ParticleNet）在GPU上执行时，与纯CPU执行相比，速度提升最高达4.2倍，单事件推理延迟从约12 ms降低至约3 ms。
框架保持了极低的网络引入延迟，客户端-服务器往返时间平均低于2 ms，证实通信开销并未显著影响性能。
在优化配置下，GPU利用率最高达到90%，证明了在多个推理请求间实现了有效的负载均衡与动态扩展。
SONIC方法实现了在不同协处理器类型（如从GPU到FPGA）之间无缝迁移机器学习工作负载，且代码修改极少，展现出强大的可移植性。
混合部署（结合本地Tier-2中心与基于云的GPU资源）实现了稳定的性能提升，验证了该框架在分布式计算基础设施中的适应能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。