QUICK REVIEW

[论文解读] Accelerating the computation of FLAPW methods on heterogeneous architectures

Davor Davidović, Diego Fabregat‐Traver|arXiv (Cornell University)|Dec 19, 2017

Parallel Computing and Optimization Techniques参考文献 20被引用 2

一句话总结

该论文通过使用BLAS-3内核和混合CPU-GPU/Phi卸载技术，重新设计了FLEUR软件的哈密顿量和重叠矩阵计算，显著加速了FLAPW电子结构计算。通过实现动态和静态混合BLAS例程，该方法在异构系统上实现了超过70%的峰值性能，并在JURECA超算节点上实现了5倍的加速，充分调动了CPU和加速器资源。

ABSTRACT

Legacy codes in computational science and engineering have been very successful in providing essential functionality to researchers. However, they are not capable of exploiting the massive parallelism provided by emerging heterogeneous architectures. The lack of portable performance and scalability puts them at high risk: either they evolve or they are doomed to disappear. One example of legacy code which would heavily benefit from a modern design is FLEUR, a software for electronic structure calculations. In previous work, the computational bottleneck of FLEUR was partially re-engineered to have a modular design that relies on standard building blocks, namely BLAS and LAPACK. In this paper, we demonstrate how the initial redesign enables the portability to heterogeneous architectures. More specifically, we study different approaches to port the code to architectures consisting of multi-core CPUs equipped with one or more coprocessors such as Nvidia GPUs and Intel Xeon Phis. Our final code attains over 70\% of the architectures' peak performance, and outperforms Nvidia's and Intel's libraries. Finally, on JURECA, the supercomputer where FLEUR is often executed, the code takes advantage of the full power of the computing nodes, attaining $5 imes$ speedup over the sole use of the CPUs.

研究动机与目标

现代化遗留的FLEUR代码，以利用具有大规模并行性的异构架构。
解决像FLEUR这样的遗留科学代码在性能可移植性和可扩展性方面的局限性。
克服在使用仅针对单一设备类型的厂商优化BLAS库时，GPU和Intel Xeon Phi未被充分利用的问题。
开发高效的混合BLAS实现，以在高性能电子结构计算中充分利用CPU和加速器。
通过集成CPU和加速器计算资源，实现对JURECA等现代超算节点的完全利用。

提出的方法

使用标准化的BLAS-3操作（如zherk、zgemm）重新实现HSDLA算法，以提升模块化和可移植性。
设计两种混合BLAS策略：一种基于任务队列和缓冲区的动态方法，另一种基于固定矩阵划分比例的静态方法。
通过减少内存占用和重新组织HSDLA算法的计算流程，优化内存访问和数据移动。
将优化后的代码移植到具有多核CPU和加速器（NVIDIA GPU和Intel Xeon Phi）的异构系统上。
利用高度优化的库（cuBLASXT、MKL）和自定义混合内核，在JURECA超算上实现高性能。
通过GFLOPS和相对于基线及厂商库的加速比指标，在多个测试用例（TiO2、AuAg）和Kmax值下评估性能。

实验结果

研究问题

RQ1对像FLEUR这样的遗留科学代码，基于BLAS的重构在多大程度上能够实现在异构架构上的性能可移植性？
RQ2当使用不支持混合CPU-加速器执行的厂商优化库将BLAS操作卸载到加速器时，性能瓶颈是什么？
RQ3在GPU和Xeon Phi架构上，动态和静态混合BLAS实现的性能、可移植性和可调性如何比较？
RQ4当CPU和加速器同时使用时，自定义混合内核能否超越厂商优化库（如cuBLASXT、MKL）？
RQ5在像JURECA这样的现代超算节点上，当同时充分利用CPU和多个GPU时，可实现的性能和可扩展性如何？

主要发现

通过算法优化，优化后的HSDLA实现减少了内存占用和计算成本，从而在异构系统上实现了更好的性能。
动态混合BLAS方法在4块GPU上对zherk内核的性能比cuBLASXT最高提升1.19倍，在TiO2测试用例中对zgemmt内核的性能提升为1.13倍。
静态混合BLAS方法在zherk上实现1.19倍加速，在zgemmt上实现1.13倍加速，且通过矩阵划分比例具有更好的可调性。
在JURECA上，最终实现的性能超过系统峰值性能的70%，与仅使用CPU相比实现了5倍加速。
混合实现通过并行高效利用CPU和加速器，性能优于NVIDIA的cuBLASXT和Intel的MKL库。
代码在多个GPU上表现出强可扩展性，AuAg测试用例中GFLOPS从1块GPU的1.5 TFLOPS提升至4块GPU的6 TFLOPS。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。