QUICK REVIEW

[论文解读] GenDRAM:Hardware-Software Co-Design of General Platform in DRAM

Tsung-Han Lu, Weihong Xu|arXiv (Cornell University)|Feb 27, 2026

Parallel Computing and Optimization Techniques被引用 0

一句话总结

GenDRAM 提出一种单片化3D DRAM Processing-in-Memory 加速器，将 APSP 和基因组序列比对工作负载整合在单一异构芯片上，通过协同设计数据映射、处理单元（PUs）和执行模式，在与GPU相比下实现显著加速。

ABSTRACT

Dynamic programming (DP) algorithms, such as All-Pairs Shortest Path (APSP) and genomic sequence alignment, are fundamental to many scientific domains but are severely bottlenecked by data movement on conventional architectures. While Processing-in-Memory (PIM) offers a promising solution, existing accelerators often address only a fraction of the work-flow, creating new system-level bottlenecks in host-accelerator communication and off-chip data streaming. In this work, we propose GenDRAM, a massively parallel PIM accelerator that overcomes these limitations. GenDRAM leverages the immense capacity and internal bandwidth of monolithic 3D DRAM(M3D DRAM) to integrate entire data-intensive pipelines, such as the full genomics workflow from seeding to alignment, onto a single heterogeneous chip. At its core is a novel architecture featuring specialized Search PUs for memory-intensive tasks and universal, multiplier-less Compute PUs for diverse DP calculations. This is enabled by a 3D-aware data mapping strategy that exploits the tiered latency of M3D DRAM for performance optimization. Through comprehensive simulation, we demonstrate that GenDRAM achieves a transformative performance leap, outperforming state-of-the-art GPU systems by over 68x on APSP and over 22x on the end-to-end genomics pipeline.

研究动机与目标

在基于DP的工作负载如APSP和基因组比对中，激发消除数据移动瓶颈的动机。
提出基于单体化3D DRAM（M3D DRAM）的PIM架构，将多样的DP工作负载统一在一个芯片上。
开发异构PU设计（Search PUs与Compute PUs）以及面向3D的数据映射策略，以利用分层DRAM延迟。
实现端到端基因组学流水线（seed到alignment）在片上执行，消除主机与加速器之间的瓶颈。
展示相对于最先进GPU与领域专用加速器的性能与能耗优势。

提出的方法

提出GenDRAM体系结构，逻辑芯片承载32个PUs，与DRAM堆栈紧密耦合。
提供两种PU类型：8个Search PUs用于seed阶段，24个Compute PUs用于DP计算。
在Compute PUs内使用Max/Min Engine及其专用子单元实现APSP与序列比对。
应用面向3D的_data mapping_：将对延迟敏感的数据放在快速DRAM分层中；以带宽优化的跨银行/通道的交错映射。
采用统一的Dynamic Programming抽象，将其视为在半环（对FW/APSP为min-plus，对比 alignment为max-plus）的广义网格更新。
支持两种执行模式：同质APSP广播（APSP Broadcast）与异质基因组学流水线（seed + alignment）并带流水线调度。

实验结果

研究问题

RQ1GenDRAM是否能够在单一基底上同时加速APSP和基因组序列比对？
RQ2数据放置与运行时调度如何利用M3D DRAM的分层延迟和巨量内部带宽来服务DP工作负载？
RQ3与当代GPU及领域专用加速器相比，GenDRAM在性能与能耗上有哪些优势？
RQ4统一的PU（32 PUs）在同时支持32位min-plus与5位max-plus DP工作负载时，会带来哪些设计权衡？

主要发现

相较于NVIDIA A100，GenDRAM在APSP上达到最高67×加速，在端到端基因组学流水线上达到22×加速。
复杂生物信息学流水线的平均功耗为31.2 W，APSP为10.2 W。
GenDRAM在能效方面相对A100基线提高了152×，相对RapidGraph加速器提升了20×。
32-PU配置可通过1:1 PU-to-bank-group映射使M3D DRAM的内部带宽达到饱和（约34 TB/s）。
对DP工作负载而言，延迟感知的分层映射和带宽感知的交错映射是充分利用M3D DRAM的关键。
该架构通过异质流水线在内存受限的seed阶段与计算受限的alignment阶段实现了基因组学的协同设计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。