QUICK REVIEW

[论文解读] SMASH: Sparse Matrix Atomic Scratchpad Hashing

Kaustubh Shivdikar|arXiv (Cornell University)|Jan 1, 2021

Parallel Computing and Optimization Techniques参考文献 98被引用 5

一句话总结

SMASH 为 PIUMA 架构引入了一种新颖的 SpGEMM 内核，采用原子哈希、分词和内存碎片化技术，在基线实现基础上实现了 9.4× 的加速。通过利用 PIUMA 的分布式全局地址空间（DGAS）、DMA 引擎和多线程核心，SMASH V3 最大化了 DRAM 带宽利用率（95.9%），并通过动态工作负载均衡和部分积的内存内哈希表合并，实现了接近 100% 的核心利用率。

ABSTRACT

Sparse matrices, more specifically SpGEMM kernels, are commonly found in a wide range of applications, spanning graph-based path-finding to machine learning algorithms (e.g., neural networks). A particular challenge in implementing SpGEMM kernels has been the pressure placed on DRAM memory. One approach to tackle this problem is to use an inner product method for the SpGEMM kernel implementation. While the inner product produces fewer intermediate results, it can end up saturating the memory bandwidth, given the high number of redundant fetches of the input matrix elements. Using an outer product-based SpGEMM kernel can reduce redundant fetches, but at the cost of increased overhead due to extra computation and memory accesses for producing/managing partial products. In this thesis, we introduce a novel SpGEMM kernel implementation based on the row-wise product approach. We leverage atomic instructions to merge intermediate partial products as they are generated. The use of atomic instructions eliminates the need to create partial product matrices. To evaluate our row-wise product approach, we map an optimized SpGEMM kernel to a custom accelerator designed to accelerate graph-based applications. The targeted accelerator is an experimental system named PIUMA, being developed by Intel. PIUMA provides several attractive features, including fast context switching, user-configurable caches, globally addressable memory, non-coherent caches, and asynchronous pipelines. We tailor our SpGEMM kernel to exploit many of the features of the PIUMA fabric. This thesis compares our SpGEMM implementation against prior solutions, all mapped to the PIUMA framework. We briefly describe some of the PIUMA architecture features and then delve into the details of our optimized SpGEMM kernel. Our SpGEMM kernel can achieve 9.4x speedup as compared to competing approaches.

研究动机与目标

解决通用架构上 SpGEMM 的性能瓶颈，该瓶颈源于不规则内存访问和工作负载不平衡。
优化领域专用 PIUMA 架构上的 SpGEMM 内核映射，以利用其独特特性，如 DGAS、DMA 引擎和多线程核心。
消除基于行的 SpGEMM 方法中由中间部分积矩阵引起的冗余 DRAM 访问。
通过在多线程之间动态平衡工作负载并最小化空闲周期，实现高硬件利用率。
探索在内存内基于哈希表的合并部分积的可行性，以避免片上缓存（SPAD）的限制并提高数据重用性。

提出的方法

使用原子哈希实现 SMASH V1，将部分积直接存储在全局哈希表中，避免中间矩阵存储和冗余的 DRAM 读取。
引入 SMASH V2，通过分词技术根据每行的估计 FLOPs 动态分配工作，实现工作负载均衡，提升线程利用率。
应用 SMASH V3，结合内存碎片化和生产者-消费者模型，减少内存访问开销并改善数据局部性。
利用 PIUMA 的 DMA 引擎将数据移动任务从计算核心卸载，使 MTCs 专注于计算，降低指令周期开销。
将哈希表存储在 DRAM 而非片上 SPAD 中，以减轻片上内存压力，并支持更大规模的稀疏矩阵运算。
结合高位和低位比特哈希进行冲突解决，并通过动态负载均衡缓解热点问题。

实验结果

研究问题

RQ1如何优化 SpGEMM 内核，以在领域专用加速器上实现 DRAM 带宽的近饱和？
RQ2动态工作负载均衡在提升不规则 SpGEMM 工作负载中多线程核心利用率方面发挥什么作用？
RQ3内存内基于哈希表的合并能否消除对中间部分积矩阵的需求并减少冗余内存访问？
RQ4DGAS、DMA 引擎和网络化指令等架构特性如何影响 PIUMA 上的 SpGEMM 性能？
RQ5在 SpGEMM 内核中，使用 DRAM 存储的哈希表与片上缓存相比，性能权衡如何？

主要发现

SMASH V3 通过结合动态负载均衡、内存碎片化和高效的 DMA 使用，在 SMASH V1 基础上实现了 9.4× 的加速。
SMASH V3 利用了 95.9% 的可用 DRAM 带宽，尽管输入数据和部分积共享内存访问，仍几乎饱和了内存子系统。
与 V1 相比，SMASH V3 的指令吞吐量提高了 155%，表明计算效率更高且空闲周期更少。
由于通过生产者-消费者模型实现了有效的负载均衡，SMASH V3 的线程利用率接近 100%，显著优于未平衡的 V1。
使用内存内哈希表可立即合并部分积，消除了对中间存储的需求，减少了内存占用和访问开销。
尽管哈希冲突可能带来热点风险，但带宽饱和和负载均衡带来的整体性能增益仍远超冲突解决开销的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。