Skip to main content
QUICK REVIEW

[论文解读] Techniques for Shared Resource Management in Systems with Throughput Processors

Rachata Ausavarungnirun|arXiv (Cornell University)|Jan 1, 2017
Parallel Computing and Optimization Techniques参考文献 278被引用 6
一句话总结

本论文提出了一种GPU感知的内存管理技术,以缓解吞吐量处理器系统中应用间与应用内干扰问题。该研究引入了MeDiC(用于线程束感知缓存管理)、SMS(用于CPU-GPU内存调度的分阶段内存调度器)、MASK(用于TLB感知内存管理)以及Mosaic(用于大页分配的软硬件协同设计)——这些技术共同提升了多应用GPU工作负载下的性能、公平性与效率。

ABSTRACT

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications. However, this success has been accompa- nied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems. This dissertation identifies and eliminates performance bottlenecks caused by major sources of interference throughout the memory hierarchy. Specifically, we provide an in-depth analysis of inter- and intra-application as well as inter- address-space interference that significantly degrade the performance and efficiency of GPU-based systems. To minimize such interference, we introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications’ charac- teristics. We introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy. First, we introduce Memory Divergence Correction (MeDiC), a cache management mecha- nism that mitigates intra-application interference in GPGPU applications by allowing the shared L2 cache and the memory controller to be aware of the GPU’s warp-level memory divergence characteristics. MeDiC uses this warp-level memory divergence information to give more cache space and more memory bandwidth to warps that benefit most from utilizing such resources. Our evaluations show that MeDiC significantly outperforms multiple state-of-the-art caching policies proposed for GPUs. Second, we introduce the Staged Memory Scheduler (SMS), an application-aware CPU-GPU memory request scheduler that mitigates inter-application interference in heterogeneous CPU-GPU systems. SMS creates a fundamentally new approach to memory controller design that decouples the memory controller into three significantly simpler structures, each of which has a separate task, These structures operate together to greatly improve both system performance and fairness. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus on inter-application scheduling decisions. These two stages en- force high-level policies regarding performance and fairness. As a result, the last stage is simple logic that deals only with the low-level DRAM commands and timing. SMS is also configurable: it allows the system software to trade off between the quality of service provided to the CPU versus GPU applications. Our evaluations show that SMS not only reduces inter-application interference caused by the GPU, thereby improving heterogeneous system performance, but also provides better scalability and power efficiency compared to multiple state-of-the-art memory schedulers. Third, we redesign the GPU memory management unit to efficiently handle new problems caused by the massive address translation parallelism present in GPU computation units in multi- GPU-application environments. Running multiple GPGPU applications concurrently induces significant inter-core thrashing on the shared address translation/protection units; e.g., the shared Translation Lookaside Buffer (TLB), a new phenomenon that we call inter-address-space interference. To reduce this interference, we introduce Multi Address Space Concurrent Kernels (MASK). MASK introduces TLB-awareness throughout the GPU memory hierarchy and introduces TLBand cache-bypassing techniques to increase the effectiveness of a shared TLB. Finally, we introduce Mosaic, a hardware-software cooperative technique that further increases the effectiveness of TLB by modifying the memory allocation policy in the system software. Mosaic introduces a high-throughput method to support large pages in multi-GPU-application environments. The key idea is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our evaluations show that the MASK-Mosaic combination provides a simple mechanism that eliminates the performance overhead of address translation in GPUs without significant changes to GPU hardware, thereby greatly improving GPU system performance. The key conclusion of this dissertation is that a combination of GPU-aware cache and memory management techniques can effectively mitigate the memory interference on current and future GPU-based systems as well as other types of throughput processors.

研究动机与目标

  • 识别并消除GPU内存层次结构中由应用间与应用内干扰引起的关键性能瓶颈。
  • 设计同时考虑CPU与GPU应用特性的内存管理机制。
  • 在共享内存资源的异构CPU-GPU系统中,提升整体系统性能、公平性与能效。
  • 解决新型干扰现象,如并发GPGPU应用引起的跨地址空间干扰。
  • 在无需显著硬件改动的前提下,实现在多GPU环境中的高效、高吞吐量内存分配。

提出的方法

  • 提出MeDiC,一种利用线程束级别内存访问差异信息的缓存管理机制,可将更多缓存与内存带宽分配给最受益的线程束。
  • 提出分阶段内存调度器(SMS),一种三阶段内存控制器,将内存请求分组、应用间调度与底层DRAM命令生成解耦。
  • 设计MASK,一种TLB感知的GPU内存管理单元,通过TLB与缓存旁路技术显著减少多应用环境下共享TLB与缓存单元的线程间竞争。
  • 开发Mosaic,一种软硬件协同设计技术,通过保持虚拟地址空间的连续性,实现在不移动数据的前提下高效的大页合并。
  • 将MASK与Mosaic结合,以极小的硬件改动消除多GPU工作负载中的地址翻译开销。
  • 利用对应用特性的动态分析,指导内存层次结构中运行时资源分配决策。

实验结果

研究问题

  • RQ1如何通过GPU缓存机制最小化由线程束级别内存访问差异引起的应用内干扰?
  • RQ2如何设计一种可扩展且公平的内存调度方法,以减少CPU与GPU工作负载之间的应用间干扰?
  • RQ3在共享TLB与缓存结构中,如何缓解并发GPGPU应用引起的跨地址空间干扰?
  • RQ4虚拟内存连续性在多GPU系统中实现高效大页管理方面发挥何种作用?
  • RQ5软件与硬件如何协同工作,以消除GPU内存层次结构中的地址翻译开销?

主要发现

  • MeDiC通过根据线程束级别内存访问差异特性动态分配缓存与内存带宽,其性能优于多种最先进的GPU缓存策略。
  • SMS有效减少应用间干扰,提升系统性能与公平性,并在可扩展性与能效方面优于现有内存调度器。
  • MASK通过TLB感知与旁路机制,在多应用GPU环境中显著减少了共享TLB与缓存单元的线程间竞争。
  • MASK与Mosaic的结合实现了在多GPU工作负载中无需数据移动的高效大页支持,仅通过极小的硬件改动即消除了大部分地址翻译开销。
  • 综合技术通过缓解内存层次结构中的各类干扰,显著提升了整体GPU系统性能与效率。
  • 评估结果表明,所提机制能有效应对现代GPU系统中出现的新兴瓶颈,尤其在并发GPGPU工作负载下表现突出。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。