QUICK REVIEW

[论文解读] Performance Modeling and Prediction for Dense Linear Algebra

Elmar Peise|arXiv (Cornell University)|Jan 1, 2017

Parallel Computing and Optimization Techniques参考文献 79被引用 2

一句话总结

本文提出了一种基于测量的密集线性代数工作负载性能建模与预测技术，通过低开销内核运行时模型估算性能，避免了完整算法执行。该方法实现了在多种硬件平台上快速、准确地选择最优算法配置（如分块大小和张量遍历顺序），且开销极低。

ABSTRACT

This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel.

研究动机与目标

开发一种无需执行算法即可预测密集线性代数算法性能的框架。
在不同硬件平台上识别阻塞BLAS内核的最优分块大小和算法配置。
通过用轻量级微基准测试替代完整执行，降低性能调优成本。
为高性能线性代数工作负载实现自动、平台特定的性能建模。

提出的方法

基于目标平台上的运行时测量，构建与算法无关的BLAS内核性能模型。
分析内核参数（如主维度、步长、大小）以建模缓存对齐和组关联冲突等性能影响。
使用基于Python的框架（ELAPS）收集并分析多种配置下的性能数据。
设计针对张量收缩算法规则结构的缓存感知微基准测试。
通过分段多项式拟合和重复测量的统计汇总生成预测模型。
利用这些模型即时预测给定问题规模和硬件下的最快算法配置。

实验结果

研究问题

RQ1如何在不执行完整算法的情况下实现对密集线性代数算法的性能预测？
RQ2在现代架构上，哪些内核级性能影响因素对BLAS级别操作的运行时间影响最大？
RQ3轻量级微基准测试能否准确预测张量收缩的最优配置？
RQ4如何自动生成并跨不同硬件平台重用性能模型？
RQ5实现高精度性能预测所需的最小测量开销是多少？

主要发现

所提出的性能模型在预测最快算法配置方面具有高精度，平均预测结果与实际性能偏差在5%以内。
张量收缩的微基准测试仅需完整收缩运行时间的一小部分（例如，<1%）即可识别出最优的遍历顺序与计算内核组合。
该框架成功在多种矩阵运算和硬件平台上识别出接近最优的分块大小。
性能建模考虑了关键低层影响因素，如缓存行对齐、组关联冲突以及Turbo Boost波动。
ELAPS框架实现了自动化、可重复的性能测量与模型生成，用户干预极少。
该方法通过消除配置搜索过程中对完整算法执行的需求，显著降低了性能调优成本。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。