QUICK REVIEW

[论文解读] ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems

Nika Mansouri Ghiasi, Nandita Vijaykumar|arXiv (Cornell University)|Dec 13, 2022

Parallel Computing and Optimization Techniques被引用 1

一句话总结

ALP 是一种程序员透明的软硬件协同设计技术，通过编译器分析与运行时硬件支持，主动在数据被使用前将其从 CPU 传输到近数据处理（NDP）单元，从而主动缓解 CPU 与 NDP 单元之间跨段数据移动的开销。该技术实现了高效的程序划分，在各类工作负载下，相较于仅在主机上执行，平均提速 54.3%；相较于仅在 NDP 上执行，平均提速 45.4%。

ABSTRACT

Partitioning applications between NDP and host CPU cores causes inter-segment data movement overhead, which is caused by moving data generated from one segment (e.g., instructions, functions) and used in consecutive segments. Prior works take two approaches to this problem. The first class of works maps segments to NDP or host cores based on the properties of each segment, neglecting the inter-segment data movement overhead. The second class of works partitions applications based on the overall memory bandwidth saving of each segment, and does not offload each segment to the best-fitting core if they incur high inter-segment data movement. We show that 1) mapping each segment to its best-fitting core ideally can provide substantial benefits, and 2) the inter-segment data movement reduces this benefit significantly. To this end, we introduce ALP, a new programmer-transparent technique to leverage the performance benefits of NDP by alleviating the inter-segment data movement overhead between host and memory and enabling efficient partitioning of applications. ALP alleviates the inter-segment data movement overhead by proactively and accurately transferring the required data between the segments. This is based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program on different inputs. ALP uses a compiler pass to identify these instructions and uses specialized hardware to transfer data between the host and NDP cores at runtime. ALP efficiently maps application segments to either host or NDP considering 1) the properties of each segment, 2) the inter-segment data movement overhead, and 3) whether this overhead can be alleviated in a timely manner. We evaluate ALP across a wide range of workloads and show on average 54.3% and 45.4% speedup compared to only-host CPU or only-NDP executions, respectively.

研究动机与目标

解决在内存密集型系统中，主机 CPU 与 NDP 计算单元之间因跨段数据移动而导致的性能下降问题。
识别并缓解数据移动对卸载内存密集型段至 NDP 单元所带来性能收益的负面影响。
通过综合考虑架构适配性与跨段数据移动开销，实现主机与 NDP 核心之间的高效程序划分。
开发一种透明、可扩展的解决方案，适用于不同 NDP 系统架构，且无需程序员干预。

提出的方法

使用编译器阶段静态识别生成跨段数据的指令，从而预测所需的数据传输。
通过专用硬件支持，在运行时根据编译器识别出的数据依赖关系，主动在主机与 NDP 核心之间传输数据。
结合静态（编译器）与动态（运行时）信息，做出智能的划分决策，以平衡计算收益与数据移动成本。
考虑段级属性（如内存访问模式）及估算的数据移动开销，将每个基本块映射到其最优执行单元。
利用跨程序执行中生成跨段数据的指令保持一致的观察，实现可靠预测。
通过在逻辑层抽象具体硬件细节，支持在多种 NDP 架构中灵活部署。

实验结果

研究问题

RQ1主机与 NDP 单元之间的跨段数据移动在内存密集型系统中对程序划分性能有何影响？
RQ2跨段数据移动在多大程度上降低了将内存密集型段卸载至 NDP 单元的性能收益？
RQ3主机与 NDP 核心之间的主动数据传输能否缓解跨段数据移动的性能开销，并释放 NDP 的全部潜力？
RQ4是否可能在无需程序员参与的情况下，实现主机与 NDP 单元之间高效、透明且可扩展的程序划分？

主要发现

与在主机 CPU 上执行整个应用程序相比，ALP 实现了平均 54.3% 的加速，展现出显著的性能提升。
与在 NDP 核心上执行整个应用程序相比，ALP 实现了平均 45.4% 的加速，表明 NDP 资源得到了有效利用。
若无 ALP，跨段数据移动可使性能收益降低高达 56.3%，甚至导致相比主机仅执行的场景出现性能下降。
基于仅考虑段特征的朴素划分策略，由于未缓解的数据移动开销，其性能平均下降高达 9.5%。
ALP 通过主动管理数据移动，实现了接近理想的划分，使各段能够映射至其最佳匹配的执行单元。
该技术在广泛的工作负载中均表现有效，表明其在内存密集型系统设计中具有广泛的适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。