QUICK REVIEW

[论文解读] High Performance Implementation of Boris Particle Pusher on DPC++. A First Look at oneAPI

Valentin Volokitin, A. V. Bashinov|arXiv (Cornell University)|Apr 9, 2021

Laser-induced spectroscopy and plasma参考文献 25被引用 5

一句话总结

本文提出了一种高性能的 DPC++ 实现的 Boris 粒子推进算法，该实现源自为 Hi-Chi 框架优化的 C++ 代码，其在 Intel Xeon CPU 上的性能与原始代码几乎完全一致（平均仅慢约 10%），且在 Intel GPU 上表现出意外出色的开箱即用性能（例如，Iris Xe Max 显卡在 SoA 布局下达到 1.00 NSPS），证明了 oneAPI 在无需大量 GPU 特定调优的情况下，可在异构架构上实现可移植的高性能等离子体模拟代码的潜力。

ABSTRACT

New hardware architectures open up immense opportunities for supercomputer simulations. However, programming techniques for different architectures vary significantly, which leads to the necessity of developing and supporting multiple code versions, each being optimized for specific hardware features. The oneAPI framework, recently introduced by Intel, contains a set of programming tools for the development of portable codes that can be compiled and fine-tuned for CPUs, GPUs, FPGAs, and accelerators. In this paper, we report on the experience of porting the implementation of Boris particle pusher to oneAPI. Boris particle pusher is one of the most demanding computational stages of the Particle-in-Cell method, which, in particular, is used for supercomputer simulations of laser-plasma interactions. We show how to adapt the C++ implementation of the particle push algorithm from the Hi-Chi project to the DPC++ programming language and report the performance of the code on high-end Intel CPUs (Xeon Platinum 8260L) and Intel GPUs (P630 and Iris Xe Max). It turned out that our C++ code can be easily ported to DPC++. We found that on CPUs the resulting DPC++ code is only ~10% on average inferior to the optimized C++ code. Moreover, the code is compiled and run on new Intel GPUs without any specific optimizations and shows the expected performance, taking into account the parameters of the hardware.

研究动机与目标

评估使用 oneAPI 将高性能 C++ 粒子推进器移植到 DPC++ 以实现异构计算的可行性与性能表现。
将 DPC++ 实现版本在 Intel CPU 和 GPU 上的性能与手工优化的 C++ 版本进行对比。
评估数据布局（AoS 与 SoA）以及 NUMA 意识内存访问在 DPC++ 中对性能的影响。
确定 DPC++ 是否能在无需大量 GPU 特定调优的情况下，实现跨 CPU 和 GPU 的可移植、高性能代码。

提出的方法

使用 oneAPI 编程模型，将 Hi-Chi 框架中的优化 C++ 粒子推进器实现移植到 DPC++。
将基于 OpenMP 的并行机制替换为基于 SYCL 的 DPC++ 构造，采用 oneDNN 风格的并行内核。
采用 AoS（结构体数组）和 SoA（数组结构体）两种数据布局，以评估在 CPU 和 GPU 上的内存访问模式。
在多插槽 CPU 上配置 DPC++ 代码以支持 NUMA 意识内存访问，优化缓存局部性和内存带宽。
在高端 Intel Xeon Platinum 8260L CPU 和 Intel P630 / Iris Xe Max GPU 上执行单精度和双精度性能基准测试。
在 CPU 和 GPU 目标上使用相同的计算内核，以评估可移植性和性能可移植性。

实验结果

研究问题

RQ1能否在仅做少量修改且性能损失可接受的前提下，将高性能 C++ 粒子推进器有效移植到 DPC++？
RQ2在未进行任何 GPU 特定优化的情况下，DPC++ 实现版本在 Intel GPU 上的性能与优化后的 CPU 代码相比如何？
RQ3在 DPC++ 中，数据布局（AoS 与 SoA）对 CPU 和 GPU 性能的影响如何？
RQ4在多插槽 CPU 上，NUMA 意识内存访问在多大程度上提升了 DPC++ 的性能？
RQ5oneAPI 是否能在仅需极少重构的情况下，实现跨 CPU 和 GPU 的可移植、高性能 HPC 代码用于等离子体模拟？

主要发现

在 Intel Xeon Platinum 8260L CPU 上，DPC++ 实现的平均性能仅比手工优化的 C++ 代码慢约 10%，证明了移植效率高且性能损失极小。
在 48 个 CPU 核心上，DPC++ 实现的强可扩展效率最高达 63%，表明并行化良好且 NUMA 意识内存访问有效。
在 Intel Iris Xe Max GPU 上，DPC++ 实现采用单精度 SoA 布局时达到 1.00 纳秒/粒子/步（NSPS），远超未优化移植的预期表现。
在 GPU 上，AoS 与 SoA 布局的性能差异超过 50%（例如，Iris Xe Max 上分别为 2.10 与 1.00 NSPS），凸显了数据布局在 GPU 内存访问模式中的关键作用。
尽管未进行 GPU 特定优化，DPC++ 实现版本在 P630 和 Iris Xe Max GPU 上的运行速度分别比高端 CPU 慢 3.5–4.5× 和 1.7–2.6×，表明存在显著的优化潜力。
结果证实，oneAPI 和 DPC++ 能够实现跨 CPU 和 GPU 的可移植、高性能 HPC 代码，初始移植工作量小，且在新型 Intel GPU 架构上展现出令人鼓舞的性能表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。