[论文解读] Where to Split? A Pareto-Front Analysis of DNN Partitioning for Edge Inference
论文提出了ParetoPipe,这是一个开源框架,将边缘推理中的DNN分区视为多目标问题,在异构边缘硬件和网络条件下映射延迟-吞吐量Pareto前沿。
The deployment of deep neural networks (DNNs) on resource-constrained edge devices is frequently hindered by their significant computational and memory requirements. While partitioning and distributing a DNN across multiple devices is a well-established strategy to mitigate this challenge, prior research has largely focused on single-objective optimization, such as minimizing latency or maximizing throughput. This paper challenges that view by reframing DNN partitioning as a multi-objective optimization problem. We argue that in real-world scenarios, a complex trade-off between latency and throughput exists, which is further complicated by network variability. To address this, we introduce ParetoPipe, an open-source framework that leverages Pareto front analysis to systematically identify optimal partitioning strategies that balance these competing objectives. Our contributions are threefold: we benchmark pipeline partitioned inference on a heterogeneous testbed of Raspberry Pis and a GPU-equipped edge server; we identify Pareto-optimal points to analyze the latency-throughput trade-off under varying network conditions; and we release a flexible, open-source framework to facilitate distributed inference and benchmarking. This toolchain features dual communication backends, PyTorch RPC and a custom lightweight implementation, to minimize overhead and support broad experimentation.
研究动机与目标
- 将边缘推理的DNN分区重新表述为在延迟和吞吐量之间平衡的多目标优化问题。
- 在异构边缘硬件上对管道分区推理进行基准测试,以映射Pareto最优前沿。
- 在不同网络延迟和带宽条件下评估分区策略的鲁棒性。
- 提供一个开源框架,便于分布式推理基准测试与分析。
提出的方法
- 提出ParetoPipe,这是一个使用流水线并行化将DNN分区到边缘设备的可扩展框架。
- 实现两种通信后端:PyTorch RPC和自定义的轻量级TCP套接字后端,以研究开销。
- 对六个CNN模型进行块级执行时间分析,以识别最优切分点。
- 在Pi到Pi和Pi到GPU的设置上进行穷举的分区点测试,生成延迟-吞吐量Pareto前沿。
- 使用tc模拟不良网络条件,以研究在延迟/带宽约束下前沿的偏移。
- 将自定义后端与PyTorch RPC进行对比,以量化开销和性能差异。
实验结果
研究问题
- RQ1如何将边缘推理的DNN分区分析为在延迟和吞吐量之间平衡的多目标优化问题?
- RQ2在异构边缘硬件上,常见CNN模型的Pareto最优分区点是什么?
- RQ3网络延迟和带宽限制如何改变延迟-吞吐量前沿并影响分区决策?
- RQ4使用自定义基于套接字的后端与PyTorch RPC在分布式推理中的性能差异有多大?
- RQ5块级分析如何影响跨模型与配置的最优分区策略?
主要发现
| Model (Split) | Pi1-Exe(s) | Pi2-Exe(s) | Net-time(s) | Pi1-CPU Util(%) | Pi2-CPU Util(%) | Pi1 Mem(%) | Pi2 Mem(%) |
|---|---|---|---|---|---|---|---|
| AlexNet (P10) | 0.451 | 0.427 | 0.050 | 272.7 | 271.9 | 24.84 | 21.76 |
| AlexNet (P11) | 0.419 | 0.379 | 0.045 | 347.7 | 271.9 | 21.72 | 20.36 |
| InceptionV3 (P10) | 2.873 | 2.766 | 0.055 | 328.8 | 322.7 | 22.24 | 20.76 |
| InceptionV3 (P19) | 5.791 | 0.002 | 0.040 | 345.9 | 0.19 | 19.48 | 20.20 |
| MobileNetV2 (P3) | 0.969 | 0.941 | 0.048 | 281.2 | 309.6 | 18.51 | 16.16 |
| MobileNetV2 (P17) | 1.818 | 0.065 | 0.049 | 311.9 | 12.1 | 16.84 | 15.39 |
| ResNet18 (P2) | 0.699 | 0.846 | 0.043 | 296.7 | 329.8 | 20.21 | 16.73 |
| ResNet18 (P6) | 1.290 | 0.278 | 0.046 | 351.7 | 76.2 | 17.33 | 16.58 |
| ResNet50 (P5) | 2.690 | 2.650 | 0.052 | 342.2 | 336.0 | 23.65 | 19.29 |
| ResNet50 (P15) | 5.233 | 0.196 | 0.041 | 360.4 | 14.0 | 19.48 | 18.74 |
| VGG16 (P14) | 6.827 | 6.319 | 0.056 | 328.9 | 305.7 | 35.43 | 32.87 |
| VGG16 (P29) | 13.37 | 0.894 | 0.044 | 347.6 | 19.3 | 33.95 | 34.76 |
- ParetoFrontiers在Pi-to-Pi与Pi-to-GPU部署中显示出截然不同的最优分区点,MobileNetV2及类似模型在Pi-to-Pi上偏好非对称切分,在涉及GPU时倾向更多卸载。
- 在现实网络约束下,前沿向边缘端更多计算偏移,数据传输开销高时GPU卸载的收益降低。
- 自定义套接字后端相较于PyTorch RPC显著降低端到端延迟(MobileNetV2吞吐量示例中最高可达76%),并提升吞吐量(最高53%)。
- 块级分析揭示并非所有块成本同等,指导分区点朝着平衡计算和跨设备通信的方向发展。
- 网络条件是一类一流的瓶颈;高延迟、低带宽会使数据传输开销增加,导致GPU加速效果下降。
- 在网络瓶颈下,Pareto前沿变得稀疏,凸显对网络感知的自适应分区需求。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。