QUICK REVIEW

[논문 리뷰] Techniques for Shared Resource Management in Systems with Throughput Processors

Rachata Ausavarungnirun|arXiv (Cornell University)|2017. 01. 01.

Parallel Computing and Optimization Techniques참고 문헌 278인용 수 6

한 줄 요약

이 학위논문은 처리량 프로세서 시스템에서 응용 간 및 응용 내 간섭을 완화하기 위해 GPU 인지 메모리 관리 기법을 제안한다. 워프 수준의 캐시 관리에 적합한 MeDiC, CPU-GPU 메모리 스케줄링에 적합한 SMS, TLB 인지 메모리 관리에 적합한 MASK, 그리고 대용량 페이지 할당의 소프트웨어-하드웨어 공동 설계에 적합한 Mosaic를 도입하여, 다중 응용 프로그램 GPU 워크로드에서 성능, 공정성 및 효율성을 향상시킨다.

ABSTRACT

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime example of throughput processors that can deliver high performance for applications ranging from typical graphics applications to general-purpose data parallel (GPGPU) applications. However, this success has been accompa- nied by new performance bottlenecks throughout the memory hierarchy of GPU-based systems. This dissertation identifies and eliminates performance bottlenecks caused by major sources of interference throughout the memory hierarchy. Specifically, we provide an in-depth analysis of inter- and intra-application as well as inter- address-space interference that significantly degrade the performance and efficiency of GPU-based systems. To minimize such interference, we introduce changes to the memory hierarchy for systems with GPUs that allow the memory hierarchy to be aware of both CPU and GPU applications’ charac- teristics. We introduce mechanisms to dynamically analyze different applications’ characteristics and propose four major changes throughout the memory hierarchy. First, we introduce Memory Divergence Correction (MeDiC), a cache management mecha- nism that mitigates intra-application interference in GPGPU applications by allowing the shared L2 cache and the memory controller to be aware of the GPU’s warp-level memory divergence characteristics. MeDiC uses this warp-level memory divergence information to give more cache space and more memory bandwidth to warps that benefit most from utilizing such resources. Our evaluations show that MeDiC significantly outperforms multiple state-of-the-art caching policies proposed for GPUs. Second, we introduce the Staged Memory Scheduler (SMS), an application-aware CPU-GPU memory request scheduler that mitigates inter-application interference in heterogeneous CPU-GPU systems. SMS creates a fundamentally new approach to memory controller design that decouples the memory controller into three significantly simpler structures, each of which has a separate task, These structures operate together to greatly improve both system performance and fairness. Our three-stage memory controller first groups requests based on row-buffer locality. This grouping allows the second stage to focus on inter-application scheduling decisions. These two stages en- force high-level policies regarding performance and fairness. As a result, the last stage is simple logic that deals only with the low-level DRAM commands and timing. SMS is also configurable: it allows the system software to trade off between the quality of service provided to the CPU versus GPU applications. Our evaluations show that SMS not only reduces inter-application interference caused by the GPU, thereby improving heterogeneous system performance, but also provides better scalability and power efficiency compared to multiple state-of-the-art memory schedulers. Third, we redesign the GPU memory management unit to efficiently handle new problems caused by the massive address translation parallelism present in GPU computation units in multi- GPU-application environments. Running multiple GPGPU applications concurrently induces significant inter-core thrashing on the shared address translation/protection units; e.g., the shared Translation Lookaside Buffer (TLB), a new phenomenon that we call inter-address-space interference. To reduce this interference, we introduce Multi Address Space Concurrent Kernels (MASK). MASK introduces TLB-awareness throughout the GPU memory hierarchy and introduces TLBand cache-bypassing techniques to increase the effectiveness of a shared TLB. Finally, we introduce Mosaic, a hardware-software cooperative technique that further increases the effectiveness of TLB by modifying the memory allocation policy in the system software. Mosaic introduces a high-throughput method to support large pages in multi-GPU-application environments. The key idea is to ensure memory allocation preserve address space contiguity to allow pages to be coalesced without any data movements. Our evaluations show that the MASK-Mosaic combination provides a simple mechanism that eliminates the performance overhead of address translation in GPUs without significant changes to GPU hardware, thereby greatly improving GPU system performance. The key conclusion of this dissertation is that a combination of GPU-aware cache and memory management techniques can effectively mitigate the memory interference on current and future GPU-based systems as well as other types of throughput processors.

연구 동기 및 목표

GPU 메모리 계층에서 발생하는 응용 간 및 응용 내 간섭으로 인한 성능 저하 요인을 식별하고 제거하기.
CPU 및 GPU 응용 프로그램 특성에 대한 인지 기반 메모리 관리 메커니즘 설계하기.
공유 메모리 자원을 공유하는 이종 CPU-GPU 시스템에서 전체 시스템의 성능, 공정성 및 전력 효율성 향상시키기.
동시 실행되는 GPGPU 응용 프로그램으로 인한 새로운 간섭 현상인 응용 간 주소 공간 간섭 해결하기.
중대한 하드웨어 변경 없이도 다중 GPU 환경에서 효율적이고 고처리량의 메모리 할당 가능하게 하기.

제안 방법

워프 수준의 메모리 분리 정보를 활용해 가장 유리한 워프에 더 많은 캐시와 메모리 대역폭을 동적으로 할당하는 캐시 관리 메커니즘인 MeDiC를 도입한다.
메모리 요청 그룹화, 응용 간 스케줄링, 저수준 DRAM 명령 생성을 분리하는 3단계 메모리 컨트롤러인 스테이지드 메모리 스케줄러(SMS)를 제안한다.
TLB 및 캐시 우회 기법을 통해 다중 응용 환경에서 공유된 TLB와 캐시 유닛에서의 코어 간 경쟁을 줄이는 TLB 인지 GPU 메모리 관리 유닛인 MASK를 설계한다.
가상 주소 공간의 연속성을 유지함으로써 데이터 이동 없이도 효율적인 대용량 페이지 결합을 가능하게 하는 하드웨어-소프트웨어 공동 설계 기법인 Mosaic를 개발한다.
MASK와 Mosaic를 결합하여 하드웨어 변경을 최소화하면서도 다중 GPU 워크로드에서 주소 번역 오버헤드를 제거한다.
응용 프로그램 특성의 동적 분 析를 활용하여 메모리 계층 전반에서 런타임 자원 할당 결정을 이끌어내는 방법을 적용한다.

실험 결과

연구 질문

RQ1워프 수준의 메모리 분리로 인한 GPU 캐싱 내 응용 내 간섭을 어떻게 최소화할 수 있는가?
RQ2CPU 및 GPU 워크로드 간 응용 간 간섭을 줄이는 가용성 있고 공정한 메모리 스케줄링 접근 방식은 무엇인가?
RQ3공유된 TLB 및 캐시 구조에서 동시 실행되는 GPGPU 응용 프로그램으로 인한 응용 간 주소 공간 간섭은 어떻게 완화할 수 있는가?
RQ4다중 GPU 시스템에서 효율적인 대용량 페이지 관리 구현에 있어 가상 메모리의 연속성은 어떤 역할을 하는가?
RQ5소프트웨어와 하드웨어는 어떻게 협력하여 GPU 메모리 계층에서의 주소 번역 오버헤드를 제거할 수 있는가?

주요 결과

MeDiC는 워프 수준의 메모리 분리 특성에 기반해 캐시 및 메모리 대역폭을 동적으로 할당하므로, 여러 최신 GPU 캐싱 정책보다 뛰어난 성능을 발휘한다.
SMS는 응용 간 간섭을 줄이고, 시스템 성능과 공정성을 향상시키며, 기존 메모리 스케줄러 대비 더 뛰어난 확장성과 전력 효율성을 달성한다.
MASK는 TLB 인지성과 우회 기법을 통해 다중 응용 GPU 환경에서 공유된 TLB 및 캐시 유닛에서의 코어 간 경쟁을 크게 줄인다.
MASK-Mosaic 조합은 데이터 이동 없이도 다중 GPU 워크로드에서 효율적인 대용량 페이지 지원을 가능하게 하며, 최소한의 하드웨어 변경으로 대부분의 주소 번역 오버헤드를 제거한다.
통합된 기법들은 메모리 계층 전반에서의 간섭을 완화시켜 전체 GPU 시스템의 성능과 효율성을 향상시킨다.
평가 결과, 제안된 기법들이 현대 GPU 기반 시스템에서 나타나는 새로운 성능 저하 요인을 효과적으로 해결하며, 특히 동시 GPGPU 워크로드 환경에서 뛰어난 성능을 발휘한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.