QUICK REVIEW

[论文解读] Across-Stack Profiling and Characterization of Machine Learning Models on GPUs.

Cheng Li, Abdul Dakkak|arXiv (Cornell University)|Aug 19, 2019

Parallel Computing and Optimization Techniques参考文献 15被引用 6

一句话总结

本文提出 XSP，一种跨堆栈性能分析框架，通过利用分布式追踪和迭代测量技术，最大限度降低性能分析开销，实现对机器学习模型在硬件/软件堆栈全栈范围内的整体性、分层式性能视图。该框架可对 65 个前沿模型实现精确的延迟特征刻画，揭示了因堆栈层级依赖关系而难以察觉的性能洞察。

ABSTRACT

There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results. This paper proposes XSP - an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlates profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze 65 state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise.

研究动机与目标

为解决机器学习硬件/软件堆栈中性能分析不完整且割裂的问题，该问题限制了对延迟、吞吐量、成本和能效的优化。
通过关联来自多个堆栈层级（模型、框架、库、硬件）的性能数据，提供统一的、分层的机器学习模型执行视图。
通过迭代测量方法，在保持各堆栈层级延迟测量高精度的同时，最小化性能分析开销。
通过将性能分析与自动化分析流水线相结合，实现对多样化机器学习模型的系统性分析。
揭示传统孤立性能分析工具难以检测到的堆栈级性能瓶颈与依赖关系。

提出的方法

XSP 使用分布式追踪技术，聚合并关联来自机器学习堆栈各层级（包括模型、深度学习框架、系统库和 GPU 硬件）的性能数据。
采用分层且迭代的测量策略，在准确捕捉各堆栈层级延迟的同时，最大限度降低性能分析开销。
该框架在各堆栈层级与现有性能分析工具集成，并通过追踪标识符同步数据，实现跨层级关联。
自动化分析流水线处理聚合后的数据，提取 65 个前沿机器学习模型的性能特征。
该设计支持动态性能分析，对运行时干扰极小，确保真实执行行为的准确表征。
XSP 通过将执行过程分解为堆栈层级组件并测量其对端到端延迟的贡献，实现分层分析。

实验结果

研究问题

RQ1如何统一整个机器学习堆栈的性能分析，以提供模型执行的整体视图？
RQ2性能分析开销对测量精度有何影响？如何在不牺牲测量保真度的前提下将其最小化？
RQ3在跨硬件与软件层级分析机器学习模型时，会浮现哪些性能瓶颈与依赖关系？
RQ4不同机器学习模型在堆栈各层级的延迟分布有何差异？可识别出哪些规律？
RQ5跨堆栈性能分析揭示了哪些系统优化洞察，而孤立分析工具无法捕捉？

主要发现

XSP 在极低性能分析开销下，成功实现了对硬件/软件堆栈各层级的精确延迟测量。
该框架揭示了此前隐藏的跨多个堆栈层级的性能瓶颈，例如内核启动开销和内存传输效率低下问题。
跨堆栈关联分析暴露了框架层操作与 GPU 内核执行之间的依赖关系，这些关系在孤立分析中不可见。
自动化分析流水线在 65 个前沿模型中识别出一致的性能模式，包括计算与内存比例的变化以及算子级延迟分布特征。
XSP 识别出传统性能分析工具难以发现的优化机会，例如内核融合与内存访问调优。
迭代测量方法即使在真实工作负载下也能确保高精度，验证了该框架在系统设计与调优中的可靠性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。