QUICK REVIEW

[论文解读] Exploring Modern GPU Memory System Design Challenges through Accurate Modeling

Mahmoud Khairy, Akshay Jain|arXiv (Cornell University)|Oct 16, 2018

Parallel Computing and Optimization Techniques参考文献 29被引用 25

一句话总结

本文提出了一种显著增强版的GPGPU-Sim GPU模拟器，针对NVIDIA Volta GPU实现了高度详细的内存系统建模，将内存计数器误差降低至原始硬件的1/66，执行周期误差降低2.5倍。该改进模型捕捉了诸如乱序内存调度、自适应L1缓存和基于扇区的缓存组织等高级特性，揭示了先前模拟器在L1缓存竞争建模上存在高估，而在复杂内存调度优势建模上存在低估的问题。

ABSTRACT

This paper explores the impact of simulator accuracy on architecture design decisions in the general-purpose graphics processing unit (GPGPU) space. We perform a detailed, quantitative analysis of the most popular publicly available GPU simulator, GPGPU-Sim, against our enhanced version of the simulator, updated to model the memory system of modern GPUs in more detail. Our enhanced GPU model is able to describe the NVIDIA Volta architecture in sufficient detail to reduce error in memory system even counters by as much as 66X. The reduced error in the memory system further reduces execution time error versus real hardware by 2.5X. To demonstrate the accuracy of our enhanced model against a real machine, we perform a counter-by-counter validation against an NVIDIA TITAN V Volta GPU, demonstrating the relative accuracy of the new simulator versus the publicly available model. We go on to demonstrate that the simpler model discounts the importance of advanced memory system designs such as out-of-order memory access scheduling, while overstating the impact of more heavily researched areas like L1 cache bypassing. Our results demonstrate that it is important for the academic community to enhance the level of detail in architecture simulators as system complexity continues to grow. As part of this detailed correlation and modeling effort, we developed a new Correlator toolset that includes a consolidation of applications from a variety of popular GPGPU benchmark suites, designed to run in reasonable simulation times. The Correlator also includes a database of hardware profiling results for all these applications on NVIDIA cards ranging from Fermi to Volta and a toolchain that enables users to gather correlation statistics and create detailed counter-by-counter hardware correlation plots with minimal effort.

研究动机与目标

解决GPU架构研究中模拟精度与真实硬件行为之间日益扩大的差距。
识别并修正广泛使用的GPGPU-Sim模拟器在内存系统建模方面的重大不准确之处。
展示模拟不准确如何扭曲架构设计决策，例如高估L1缓存的影响并低估乱序内存调度的优势。
为未来GPU模拟器开发建立一个开源、可复现的验证框架。
为学术研究提供NVIDIA Volta GPU内存系统的高保真、开源模型。

提出的方法

基于逆向工程和硬件微基准测试，逐模块重新设计GPGPU-Sim的内存系统，以准确反映Volta GPU的详细行为。
将公开文档、先前微基准研究以及新发现的缓存扇区化、合并机制和写策略等洞察整合进模型。
实现了自适应L1缓存配置策略和精确的L2缓存替换逻辑，以真实反映硬件行为。
使用多样化的GPGPU工作负载，对实际NVIDIA TITAN V GPU进行逐计数器验证，对比新模型与真实硬件。
开发了Correlator工具集，用于自动化跨多个GPU代际（Fermi至Volta）的统计相关性分析、绘图和基准整合。
利用增强后的模拟器对架构设计权衡进行案例研究，比较不同配置下模拟结果与真实硬件的表现。

实验结果

研究问题

RQ1在关键性能计数器上，GPGPU-Sim的内存系统模型在模拟NVIDIA Volta GPU时，其精度与真实硬件相比如何？
RQ2现有开源模拟器在建模Volta GPU内存系统时，哪些具体架构特性被错误表示或过度简化？
RQ3内存系统模拟不准确如何影响对架构设计提案的评估，例如L1缓存旁路或乱序内存调度？
RQ4与真实硬件相比，模拟精度的提升在多大程度上减少了执行周期误差？
RQ5通过详细微基准测试和逆向工程，能否发现新的、未公开的内存系统行为？

主要发现

在模拟NVIDIA TITAN V时，与原始GPGPU-Sim 3.x模型相比，增强版GPGPU-Sim模型将内存系统计数器的平均绝对误差降低了最多66倍。
与旧模型相比，新模型将执行周期误差降低了2.5倍，显著提升了模拟保真度。
新模型在执行周期上的相关性达到96%，而原始模型仅为71%。
原始GPGPU-Sim模型错误地将L1缓存识别为性能瓶颈，高估了其对性能的影响，并低估了乱序内存调度的优势。
本研究发现了此前未公开的特性，包括Volta的基于扇区的L1和L2缓存、自适应L1配置机制，以及一种节省带宽的L2写策略。
Correlator工具集实现了模拟与真实硬件之间逐计数器的快速、自动化相关性分析，为未来模型验证提供了有力支持。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。