QUICK REVIEW

[论文解读] ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

Cody Rivera, Jieyang Chen|arXiv (Cornell University)|Feb 9, 2020

Parallel Computing and Optimization Techniques参考文献 22被引用 5

一句话总结

本文提出了 TSM2R 和 TSM2L 两种 GPU 优化算法，用于不规则形状的 tall-and-skinny 矩阵-矩阵乘法，通过重新组织数据访问和线程映射方式，针对非均匀矩阵形状，实现了高达 3.5 倍的性能提升，内存带宽利用率显著提高（最高达 55%），计算资源利用率也得到增强。

ABSTRACT

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms -- TSM2R and TSM2L -- for two classes of tall-and-skinny matrix-matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices is tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.1x~3x and improves the memory bandwidth utilization and computing power utilization by 8%~47.6% and 7%~37.3%, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.1x~3.5x and improve the memory bandwidth utilization by up to 55% when the regular-shaped matrix size is relatively small.

研究动机与目标

解决在不规则形状矩阵-矩阵乘法中，尤其是 tall-and-skinny 矩阵时，GPU 内存带宽和计算资源利用率不足的问题。
针对两种常见场景进行性能优化：大尺寸规则形状矩阵 × tall-and-skinny 矩阵（TSM2R），以及 tall-and-skinny 矩阵 × 小尺寸规则形状矩阵（TSM2L）。
在现有最先进方法未能充分挖掘 GPU 在非规则输入下能力的基础上，进一步提升资源利用率。
通过量身定制的内存访问和线程映射策略，在现代 NVIDIA GPU 微架构上实现更高的性能和效率。

提出的方法

设计 TSM2R，通过重新组织内存访问模式，提升合并访问效率并减少 bank 冲突，以优化大尺寸规则形状矩阵与 tall-and-skinny 矩阵的乘法。
实现 TSM2L，用于小尺寸规则形状矩阵与 tall-and-skinny 矩阵的乘法，重点在于减少冗余内存事务并最大化线程占用率。
采用针对 tall-and-skinny 矩阵不规则形状特征量身定制的自定义线程块映射和共享内存分块策略。
通过根据矩阵尺寸和 GPU 架构动态调整线程块尺寸，实现 GPU warp 间的工作负载均衡。
优化内存访问模式，通过合并访问和步进访问模式提升内存带宽利用率。
在多种 NVIDIA GPU 微架构上评估并调优算法，以确保可移植性和性能可移植性。

实验结果

研究问题

RQ1在不规则形状矩阵-矩阵乘法中，特别是针对 tall-and-skinny 矩阵，如何最大化 GPU 内存带宽和计算资源利用率？
RQ2现有 GPU 优化内核在处理非规则矩阵形状时，其性能瓶颈是什么？
RQ3针对现代 GPU 上的 tall-and-skinny 矩阵乘法，自定义内存访问和线程映射策略是否能显著提升性能？
RQ4与现有最先进方法相比，所提出的 TSM2R 和 TSM2L 算法在加速比和资源利用率方面表现如何？

主要发现

当规则形状矩阵为大或中等尺寸时，TSM2R 相较于最先进方法实现 1.1 倍至 3 倍的性能提升，内存带宽利用率最高提升 47.6%。
在测试的 GPU 架构上，TSM2R 的计算资源利用率提升 7% 至 37.3%。
在小尺寸规则形状矩阵上，TSM2L 实现 1.1 倍至 3.5 倍的性能提升，内存带宽利用率最高提升 55%。
所提出的算法通过更有效地利用 GPU 内存层次结构和线程级并行性，在不规则矩阵形状中显著优于现有方法。
性能提升在多种现代 NVIDIA GPU 微架构上保持一致，证明了优化策略的鲁棒性。
结果证实，针对矩阵形状的感知优化对于在 GPU 上实现不规则矩阵运算的高性能至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。