QUICK REVIEW

[论文解读] A Modern Primer on Processing in Memory

Onur Mutlu, Saugata Ghose|arXiv (Cornell University)|Dec 5, 2020

Parallel Computing and Optimization Techniques被引用 27

一句话总结

本文提出处理内存（PIM）作为解决现代计算系统中数据移动导致的性能、功耗和可扩展性瓶颈的方案。通过利用三维堆叠存储器和模拟DRAM特性，PIM实现在内存内或内存附近的计算——通过两种方法：使用内存处理（PUM）和近内存处理（PNM），在数据密集型工作负载中实现显著的性能和能效提升。

ABSTRACT

This paper discusses recent research that aims to enable computation close to data, an approach we broadly call processing-in-memory (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside memory chips or modules, in the logic layer of 3D-stacked memory, in the memory controllers, in storage devices or chips), so that data movement between the computation units and memory/storage units is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits and technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) processing-using-memory, which exploits fundamental analog operational principles of memory chips to perform massively-parallel operations in-situ in memory, (2) processing-near-memory, which exploits different logic and memory integration technologies (e.g., 3D-stacked memory technology) to place computation logic close to memory circuitry, and thereby enable high-bandwidth, low-energy, and low-latency access to data. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, compilers, programming models, and applications. Our focus is on the development of PIM designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM. We believe that the shift from a processor-centric to a memory-centric mindset (and infrastructure) remains the largest adoption challenge for PIM, which, once overcome, can unleash a fundamentally energy-efficient, high-performance, and sustainable new way of designing, using, and programming computing systems.

研究动机与目标

解决现代计算系统中处理器与主内存之间数据移动导致的日益严重的性能、功耗和可扩展性瓶颈。
识别关键趋势——如数据密集型工作负载、功耗限制以及存储技术扩展极限——这些趋势促使计算向内存附近或内存内部转移。
提出并评估两种实用的PIM方法：利用DRAM模拟特性实现的使用内存处理（PUM），以及利用三维堆叠存储器逻辑层实现的近内存处理（PNM）。
克服器件、架构、系统和编程模型等跨层挑战，以实现在未来平台中PIM的实际应用。
提供基础资源——基准测试、仿真工具和编程抽象——以加速PIM系统的研究与部署。

提出的方法

通过利用DRAM单元的模拟操作特性，提出使用内存处理（PUM），在几乎不改变硬件的前提下实现大规模并行内存内操作，如数据复制、初始化和按位操作。
展示RowClone、Ambit、SIMDRAM、Gather-Scatter DRAM以及内存内安全原语等PUM技术，以实现高性能和高能效。
提出利用三维堆叠存储器逻辑层实现近内存处理（PNM），将计算任务从CPU卸载，从而减少数据移动和延迟。
在多个抽象层次上实现PNM：应用级（如图处理的Tesseract）、函数级（如移动和GPU工作负载）以及指令级（如PIM增强指令）。
解决系统级挑战，包括内存一致性、虚拟内存支持、运行时调度以及PIM优化工作负载的数据结构设计。
开发并验证仿真基础设施和基准测试，以估算PIM的优势并指导未来研究与硬件原型设计。

实验结果

研究问题

RQ1使用内存处理（PUM）如何利用DRAM固有的模拟特性，在最小硬件修改下实现计算，并实现性能和能效的提升？
RQ2在三维堆叠存储器架构中，近内存处理（PNM）在多大程度上能减少数据移动，并在多样化工作负载中提升性能和能效？
RQ3为实现在真实计算平台中PIM的实际部署，必须解决哪些系统级挑战——如内存一致性、虚拟内存支持和运行时调度？
RQ4编程模型和代码生成工具如何抽象PIM硬件的复杂性，使应用开发者能够透明且高效地使用？
RQ5需要哪些基准测试和仿真框架，才能准确评估并指导PIM系统的大规模开发？

主要发现

通过利用DRAM的模拟行为，PUM技术如RowClone和Ambit在批量数据操作（如内存复制和初始化）中实现了高达100倍的加速和90%的功耗降低。
利用三维堆叠存储器逻辑层的近内存处理（PNM）在图处理（Tesseract）中实现了高达4.5倍的加速和60%的功耗降低，并在基因组分析和时间序列工作负载中也取得了显著性能提升。
针对移动工作负载和GPU应用的函数级PNM加速，在仅需极少软件修改的情况下，实现了高达2.5倍的性能提升。
通过PIM增强指令（PEI）实现的指令级PNM，可在不修改源代码的情况下透明加速现有代码，在GPU工作负载中实现了高达2.1倍的加速。
领域特定基准测试和仿真基础设施的开发，使得对PIM优势和开销的准确估算成为可能，从而推动未来研究与硬件设计。
真实PIM硬件原型和运行时支持（如调度和数据映射）对于验证和加速PIM在生产系统中的采用至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。