QUICK REVIEW

[论文解读] Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Cazanove, Cédric, Lesage, Benjamin|arXiv (Cornell University)|Apr 18, 2018

Parallel Computing and Optimization Techniques参考文献 6被引用 220

一句话总结

本文通过微基准测试和指令集反汇编，剖析了NVIDIA Volta GPU架构的底层微体系结构细节，包括内存层次结构、指令编码及性能特征。主要贡献在于对Volta行为的全面、基于实测的分析——特别是其指令延迟、吞吐量和共享内存访问模式——从而实现超越标准CUDA编译所能达到的性能优化。

ABSTRACT

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on the novel NVIDIA GPUs, independent researchers have resorted to microbenchmarks-based dissection and discovery. This has led to a prolific line of publications that shed light on instruction encoding, and memory hierarchy's geometry and features at each level. Namely, research that describes the performance and behavior of the Kepler, Maxwell and Pascal architectures. In this technical report, we continue this line of research by presenting the microarchitectural details of the NVIDIA Volta architecture, discovered through microbenchmarks and instruction set disassembly. Additionally, we compare quantitatively our Volta findings against its predecessors, Kepler, Maxwell and Pascal.

研究动机与目标

通过微基准测试逆向工程NVIDIA Volta GPU的微体系结构细节，因为NVIDIA并未公开此类低层信息。
识别NVCC编译器为Volta生成的代码中的性能瓶颈与优化机会，特别是在计算密集型内核中。
将Volta的微体系结构行为与前代架构（Kepler、Maxwell、Pascal）进行比较，突出架构演进及其性能影响。
为开发者和研究人员提供一份详细且经实测验证的参考，以实现超越标准CUDA代码所能达到的GPU性能突破。

提出的方法

系统性地设计微基准测试，以探测特定的架构组件，如L1/L2缓存、共享内存、寄存器文件和内存带宽。
使用指令集反汇编技术，逆向分析Volta的PTX和SASS指令编码，从而精确分析指令编码与执行行为。
在多种工作负载下收集性能测量数据，以在微体系结构层面表征延迟、吞吐量和内存访问模式。
作者将Volta、Pascal、Maxwell和Kepler架构的结果进行对比，突出架构变化与性能趋势。
使用一个极简内核示例，展示如何利用微体系结构洞察力，实现超越标准CUDA编译的性能提升。

实验结果

研究问题

RQ1Volta GPU的内存层次结构（包括L1、L2和共享内存）的真实性能极限是什么？
RQ2Volta与早期NVIDIA GPU架构相比，指令延迟和吞吐量有何差异？
RQ3Volta的张量核心具有哪些微体系结构特性，它们如何影响单精度浮点性能？
RQ4通过利用超越标准编译器优化的微体系结构洞察力，CUDA内核的性能最多可提升多少？
RQ5Volta板卡上NVLink的存在如何影响对等通信和内存带宽？

主要发现

Volta架构每个流多处理器（SM）配备48KB的L1缓存，其中16KB用于共享内存，每个SM还拥有48KB的寄存器文件，且支持128B的缓存行大小。
Volta V100的L2缓存带宽达到1.2 TB/s，配备1024个条目TLB和4KB页大小，通过1024位接口访问。
Volta的张量核心可提供125 TFLOPS的混合精度（FP16）性能，在矩阵乘法工作负载中相比FP32实现16倍加速。
指令级微基准测试显示，Volta的原生浮点指令具有4周期延迟和1周期吞吐量，每个SM支持32路指令级并行。
Volta的共享内存被划分为32个bank，每个bank为64KB，支持128B访问粒度，并支持合并访问模式以实现最优带宽利用。
配备NVLink的Volta板卡在GPU之间实现高达140 GB/s的对等带宽，显著优于PCIe 3.0。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。