QUICK REVIEW

[论文解读] DynaSOAr: A Parallel Memory Allocator for Object-oriented Programming on GPUs with Efficient Memory Access

Matthias Springer, Hidehiko Masuhara|arXiv (Cornell University)|Oct 28, 2018

Parallel Computing and Optimization Techniques被引用 3

一句话总结

DynaSOAr 是一种基于 CUDA 的无锁动态内存分配器，专为 GPU 加速的面向对象编程而设计，可同时优化内存分配与访问模式。通过结合分层位图分配器、数组中结构（SOA）数据布局以及并行全量操作，其在应用性能上实现了高达 3 倍的加速，并减少内存碎片，使在相同内存预算下可处理的问题规模最大提升至 2 倍，优于当前最先进的分配器。

ABSTRACT

Object-oriented programming has long been regarded as too inefficient for SIMD high-performance computing, despite the fact that many important HPC applications have an inherent object structure. On SIMD accelerators, including GPUs, this is mainly due to performance problems with memory allocation and memory access: There are a few libraries that support parallel memory allocation directly on accelerator devices, but all of them suffer from uncoalesed memory accesses. We discovered a broad class of object-oriented programs with many important real-world applications that can be implemented efficiently on massively parallel SIMD accelerators. We call this class Single-Method Multiple-Objects (SMMO), because parallelism is expressed by running a method on all objects of a type. To make fast GPU programming available to average programmers, we developed DynaSOAr, a CUDA framework for SMMO applications. DynaSOAr consists of (1) a fully-parallel, lock-free, dynamic memory allocator, (2) a data layout DSL and (3) an efficient, parallel do-all operation. DynaSOAr achieves performance superior to state-of-the-art GPU memory allocators by controlling both memory allocation and memory access. DynaSOAr improves the usage of allocated memory with a Structure of Arrays data layout and achieves low memory fragmentation through efficient management of free and allocated memory blocks with lock-free, hierarchical bitmaps. Contrary to other allocators, our design is heavily based on atomic operations, trading raw (de)allocation performance for better overall application performance. In our benchmarks, DynaSOAr achieves a speedup of application code of up to 3x over state-of-the-art allocators. Moreover, DynaSOAr manages heap memory more efficiently than other allocators, allowing programmers to run up to 2x larger problem sizes with the same amount of memory.

研究动机与目标

为解决面向对象 GPU 编程中动态内存分配的性能瓶颈，特别是针对数据并行工作负载。
在 GPU 等 SIMT 架构中实现高效、可扩展且无锁的内存管理，适用于具有动态对象集合的应用。
通过采用数组中结构（SOA）布局实现内存访问合并，提升内存访问效率。
支持单方法多对象（SMMO）编程模型，即一个方法被并行应用于类的所有实例。
减少内存碎片并提升堆内存利用率，使在固定内存限制下可处理更大的问题规模。

提出的方法

DynaSOAr 使用分层位图数据结构，基于原子操作无锁地管理空闲与已分配的内存块，最大限度减少竞争与碎片。
它将对象组织在固定大小的块中，并通过位图的旋转移位技术减少分配期间的线程竞争。
分配器强制采用 SOA（数组中结构）数据布局，以支持合并内存访问模式，提升内存带宽利用率。
它集成了一个并行全量操作，可在单次内核启动中同步并执行所有活动对象上的方法，从而高效支持 SMMO 工作负载。
对象指针编码了块大小与偏移量，实现高效的内存布局，并在不造成内存浪费的情况下支持类继承。
该设计通过优化数据访问与减少碎片，以牺牲原始分配速度为代价，实现了更优的整体应用性能。

实验结果

研究问题

RQ1能否设计一种 GPU 内存分配器，使其不仅优化原始分配速度，还能优化内存访问合并与数据局部性？
RQ2在无锁、并行的 GPU 环境中，如何实现面向对象工作负载下高效且可扩展的动态内存分配？
RQ3在 GPU 加速的面向对象应用中，数组中结构（SOA）布局在多大程度上能提升内存带宽利用率与缓存效率？
RQ4分层位图能否在大规模并行环境中有效管理空闲内存块，实现低碎片与高可扩展性？
RQ5并行全量操作的集成在多大程度上提升了 SMMO 风格应用在 GPU 上的性能？

主要发现

与当前最先进的 GPU 分配器相比，DynaSOAr 在应用级别性能上实现了高达 3 倍的加速，主要得益于 SOA 布局带来的内存访问合并优化。
该分配器将内存碎片减少至约 18%，并在经历大量分配与释放循环后仍保持低且稳定的碎片水平。
在相同堆大小下，由于其设计中无内部碎片，DynaSOAr 可支持比其他分配器大至 2 倍的问题规模。
分层位图中的旋转移位技术显著降低了线程竞争，提升了分配性能，消融实验表明，若无此优化，性能将出现显著下降。
通过并行全量操作进行对象枚举的开销可忽略不计，并随堆大小高效扩展，证明了分层位图设计的稳健性。
在 Linux 可扩展性基准测试中，DynaSOAr 实现了 96.9% 的堆内存利用率，优于 Halloc（49.8%）与 BitmapAlloc（98.4%），在性能与效率方面均表现更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。