QUICK REVIEW

[论文解读] Comparing single-node and multi-node performance of an important fusion HPC code benchmark

E. A. Belli, J. Candy|arXiv (Cornell University)|May 19, 2022

Magnetic confinement fusion research参考文献 2被引用 4

一句话总结

本文评估了CGYRO这一关键聚变等离子体湍流模拟代码在单节点与多节点HPC系统上的性能表现。在配备16块A100 GPU的Google Cloud单节点实例上，nl03基准测试的运行速度优于8个NERSC Perlmutter节点、16个ORNL Summit节点或256个NERSC Cori节点——在使用不到一半GPU数量的情况下，实现了相近的模拟时间。研究结果表明，大型单节点中的高带宽本地GPU互连在通信密集型聚变代码中优于多节点网络。

ABSTRACT

Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.

研究动机与目标

评估单节点与多节点HPC配置在CGYRO聚变模拟代码上的性能权衡。
确定现代大型GPU节点是否能在主流聚变模拟中超越传统的多节点领导级HPC系统。
评估互连带宽对通信密集型HPC工作负载（如CGYRO）的影响。
倡导在聚变科学HPC基础设施中采用大型单节点GPU系统。

提出的方法

在单节点（Google Cloud 16块A100 GPU）和多节点HPC系统（NERSC Perlmutter、ORNL Summit、NERSC Cori）上对CGYRO的nl03测试用例进行了基准测试。
通过测量不同配置下的求解时间与GPU利用率，比较性能表现。
分析通信模式，重点关注MPI communicators中的MPI_AllToAll与MPI_AllReduce操作。
通过对比节点内NVLink带宽与节点间网络带宽（如40–50 Gbps）评估互连性能。
采用具有6D计算网格的代表性聚变模拟工作负载，该工作负载需要频繁的数据交换。
评估GPU内存容量对问题可扩展性的影响，识别单节点部署的限制。

实验结果

研究问题

RQ1对于通信密集型聚变模拟（如CGYRO），单个大型GPU节点是否能超越多节点HPC系统？
RQ2在该工作负载下，节点内GPU互连与多节点HPC系统中节点间网络互连之间的性能差距有多大？
RQ3GPU数量及其内存容量如何影响在单节点上运行CGYRO基准测试的可行性？
RQ4现代GPU加速器在多大程度上使原本需要多节点HPC的问题得以通过单节点解决方案实现？
RQ5这对聚变能研究中的HPC资源采购策略有何影响？

主要发现

nl03基准测试在单个Google Cloud 16-A100 GPU节点上的完成时间，快于8个NERSC Perlmutter节点、16个ORNL Summit节点或256个NERSC Cori节点。
单节点配置在使用不到一半GPU数量的情况下，实现了与多节点配置相当的求解时间。
性能优势源于单节点内部更高的互连带宽（NVLink）相较于节点间网络（如40–50 Gbps）的显著优势。
尽管单GPU计算吞吐量更高，多节点HPC系统因网络瓶颈导致加速效果有限。
由于当前无厂商提供具备足够GPU内存的节点，更大规模的CGYRO模拟仍需依赖多节点系统。
即将推出的NVIDIA外部NVSwitch有望将高带宽互连扩展至最多256块GPU，为更大规模模拟提供可扩展的替代方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。