QUICK REVIEW

[论文解读] Instructions' Latencies Characterization for NVIDIA GPGPUs.

Yehia Arafa, Abdel‐Hameed A. Badawy|arXiv (Cornell University)|May 21, 2019

Parallel Computing and Optimization Techniques被引用 3

一句话总结

本文提出了一种低开销、可移植的方法，用于表征NVIDIA GPGPU在五个GPU架构（从Kepler到Turing）中的指令延迟和内存层次结构访问开销。通过测量流水线和内存行为，揭示了CUDA编译器优化对性能的影响，使开发者和架构师能够实现精确的硬件建模和明智的软件优化。

ABSTRACT

The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the micro-architecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies. We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta, and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.

研究动机与目标

揭示NVIDIA GPU中隐藏的微架构延迟特性，超越厂商文档所披露的内容。
评估CUDA编译器优化对多代GPU中指令和内存延迟的影响。
提供一种可移植、低开销的分析技术，用于测量GPGPU工作负载中的流水线和内存访问延迟。
通过揭示此前未公开的延迟行为，实现精确的硬件建模和明智的软件优化。

提出的方法

开发了一种可移植、低开销的基于内核的基准测试框架，用于测量GPU流水线中的指令延迟。
设计了微基准测试，以隔离并测量不同内存层次结构级别（寄存器、共享内存、L1、L2、全局内存）的访问延迟。
在七款高端NVIDIA GPU上执行测量，涵盖五个架构：Kepler、Maxwell、Pascal、Volta和Turing。
采用基于时间的分析方法，通过测量具有已知依赖关系的指令序列的执行时间，推断延迟值。
将观测到的延迟与编译器优化级别相关联，以评估其对性能特征的影响。
通过多次运行和多种GPU型号的验证，确保该方法的稳定性和可移植性。

实验结果

研究问题

RQ1现代NVIDIA GPU在不同架构中的实际指令延迟是多少？
RQ2不同GPU代际和内存类型之间的内存层次结构访问延迟如何变化？
RQ3CUDA编译器优化在多大程度上改变了指令和内存访问的观测延迟？
RQ4所提出的延迟表征方法在不同GPU架构之间的一致性和可移植性如何？
RQ5从延迟测量中可获得哪些有助于改进硬件建模和软件优化的洞见？

主要发现

该研究揭示了NVIDIA GPU架构之间指令延迟存在显著差异，新一代架构在关键操作上的延迟显著降低。
不同内存层级的访问延迟差异显著，其中全局内存延迟最高，寄存器延迟最低。
CUDA编译器优化（如指令调度和循环变换）显著降低了观测到的延迟，尤其在内存受限的内核中效果明显。
所提出的方法实现了高精度且性能开销极低，可在生产级GPU上实现可靠的延迟测量。
不同架构间的延迟值存在明显差异，表明性能建模必须考虑微架构差异。
结果揭示了此前未记录的延迟行为，特别是在Volta和Turing等新型架构中，这对精确性能预测至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。