QUICK REVIEW

[논문 리뷰] DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access

Matthias Springer, Hidehiko Masuhara|arXiv (Cornell University)|2018. 10. 28.

Parallel Computing and Optimization Techniques인용 수 3

한 줄 요약

DynaSOAr는 GPU 가속 객체 지향 프로그래밍을 위한 CUDA 기반, 락 없는 동적 메모리 할당기로, 메모리 할당 및 액세스 패턴을 최적화합니다. 계층적 비트맵 기반 할당기와 구조체의 배열(SOA) 데이터 레이아웃, 그리고 병렬 do-all 연산을 결합하여 응용 프로그램 성능을 최대 3배 향상시키고 메모리 분할을 줄이며, 동일한 메모리 예산 내에서 최대 2배 더 큰 문제 크기를 허용합니다. 이는 최신 기술 수준의 할당기보다 뛰어납니다.

ABSTRACT

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. To make fast GPU programming available to average programmers, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. DynaSOAr improves the usage of allocated memory with a Structure of Arrays data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

연구 동기 및 목표

객체 지향 GPU 프로그래밍에서 동적 메모리 할당의 성능 저하 문제를 해결하기 위해, 특히 데이터 병렬 워크로드에 초점을 맞춥니다.
동적 객체 집합을 가진 응용 프로그램을 위한 SIMD 아키텍처인 GPU에서 효율적이고 확장 가능하며 락 없는 메모리 관리를 가능하게 합니다.
계층적 비트맵과 함께 SOA(구조체의 배열) 레이아웃을 통해 메모리 액세스를 코alescing하여 메모리 대역폭 활용도를 향상시킵니다.
단일 메서드를 클래스의 모든 인스턴스에 동시에 적용하는 Single-Method Multiple-Objects(SMMO) 프로그래밍 모델을 지원합니다.
메모리 분할을 줄이고 힙 활용도를 향상시켜 고정된 메모리 제한 내에서 더 큰 문제 크기를 허용합니다.

제안 방법

DynaSOAr는 락 없는 원자 연산 기반으로 자유 및 할당된 메모리 블록을 관리하기 위해 계층적 비트맵 데이터 구조를 사용합니다. 이로 인해 경쟁과 분할을 최소화합니다.
객체는 고정 크기의 블록으로 구성되며, 할당 중 스레드 경쟁을 줄이기 위해 비트맵의 로테이션 시프팅 기법을 사용합니다.
메모리 액세스 패턴을 코alescing하기 위해 SOA(구조체의 배열) 데이터 레이아웃을 강제 적용하여 메모리 대역폭 활용도를 향상시킵니다.
모든 활성 객체에 대해 메서드를 동시에 동기화하고 실행하는 병렬 do-all 연산을 통합하여 SMMO 워크로드의 효율적 실행을 가능하게 합니다.
객체 포인터는 블록 크기와 오프셋을 인코딩하여 효율적인 메모리 레이아웃을 가능하게 하며, 메모리 낭비 없이 클래스 상속을 지원합니다.
할당 속도를 약간 희생하여 데이터 액세스 최적화와 분할 감소를 통해 전체 응용 프로그램 성능을 향상시킵니다.

실험 결과

연구 질문

RQ1GPU 메모리 할당기를 설계하여 순수한 할당 속도 외에도 메모리 액세스 코ales싱과 데이터 국소성을 최적화할 수 있는가?
RQ2객체 지향 워크로드를 위한 락 없는 병렬 GPU 환경에서 동적 메모리 할당을 어떻게 효율적이고 확장 가능하게 만들 수 있는가?
RQ3구조체의 배열(SOA) 레이아웃이 GPU 가속 객체 지향 응용 프로그램에서 메모리 대역폭 활용도와 캐시 효율성을 얼마나 향상시킬 수 있는가?
RQ4계층적 비트맵이 다수의 병렬 처리 환경에서 낮은 분할과 높은 확장성을 갖춘 자유 메모리 블록을 효과적으로 관리할 수 있는가?
RQ5병렬 do-all 연산의 통합이 SMMO 스타일 응용 프로그램의 GPU 성능에 어떻게 기여하는가?

주요 결과

DynaSOAr는 최신 기술 수준의 GPU 할당기 대비 응용 프로그램 수준 성능에서 최대 3배의 성능 향상을 달성합니다. 주로 SOA 레이아웃을 통한 메모리 액세스 코ales싱 향상 덕분입니다.
할당 및 해제 사이클을 반복해도 분할 수준이 약 18%로 낮고 안정적으로 유지되어 메모리 분할을 크게 줄였습니다.
동일한 힙 크기에서 다른 할당기 대비 DynaSOAr는 내부 분할이 없어 최대 2배 더 큰 문제 크기를 허용합니다.
계층적 비트맵에서의 로테이션 시프팅은 스레드 경쟁을 감소시키고 할당 성능을 향상시킵니다. 제거 실험 결과에서 이 최적화가 없을 경우 성능 저하가 뚜렷하게 나타났습니다.
병렬 do-all 연산을 통한 객체 열거는 거의 무시할 수 없는 오버헤드를 가지며 힙 크기에 따라 효율적으로 확장됩니다. 이는 계층적 비트맵 설계의 견고성을 입증합니다.
Linux Scalability 벤치마크에서 DynaSOAr는 96.9%의 힙 활용도를 기록했으며, Halloc(49.8%)과 BitmapAlloc(98.4%)을 모두 뛰어나는 성능과 효율성을 보였습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.