QUICK REVIEW

[论文解读] DeepNVM++: Cross-Layer Modeling and Optimization Framework of Non-Volatile Memories for Deep Learning

Ahmet Inci, Mehmet Meric Isgenc|arXiv (Cornell University)|Dec 8, 2020

Advanced Memory and Neural Computing参考文献 58被引用 17

一句话总结

DeepNVM++ 是一种跨层框架，用于对深度学习工作负载中 GPU 最后一级缓存的 STT-MRAM 和 SOT-MRAM 进行建模与优化。通过将电路级 NVM 特性分析与真实的 GPU 内存性能剖析相结合，该框架在相同面积条件下实现了高达 4.7× 的 EDP 降低和 3.3× 的缓存容量提升，相较于 SRAM，大容量缓存的性能提升达数量级。

ABSTRACT

Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.

研究动机与目标

解决 SRAM 在深度学习工作负载下 GPU 最后一级缓存中的可扩展性限制。
评估新兴 NVM（如 STT-MRAM 和 SOT-MRAM）在 GPU 架构中的功耗、性能与面积（PPA）权衡。
通过统一的建模框架，支持基于 NVM 的缓存设计空间探索，适用于深度学习工作负载。
在多种深度学习工作负载下，量化分析 NVM 在等容量与等面积场景下的优势。

提出的方法

将 STT-MRAM 和 SOT-MRAM 的工艺特定电路级模型与真实 GPU 工作负载的内存访问模式相结合。
在真实 GPU 平台上对深度学习工作负载（训练与推理）进行广泛的内存剖析，以实现等容量分析。
通过架构级仿真，估算不同缓存尺寸下的缓存容量与访存次数，实现等面积分析。
自动整合内存统计信息与微架构及电路级分析，以评估 PPA 指标。
采用能量延迟积（EDP）、面积和延迟作为不同缓存配置下的关键性能指标。
通过比较 NVM 与 SRAM 在广泛缓存容量范围内的表现，支持可扩展性分析。

实验结果

研究问题

RQ1在深度学习工作负载下，STT-MRAM 和 SOT-MRAM 在等容量条件下相较于 SRAM 的能量延迟积（EDP）和面积表现如何？
RQ2在缓存面积固定（等面积）条件下，使用 NVM 相较于 SRAM 的性能与能效优势是什么？
RQ3随着缓存尺寸增大，NVM 在 EDP 和容量方面的可扩展性如何，特别是在大规模深度学习推理与训练场景中？
RQ4将电路级 NVM 模型与真实 GPU 内存行为相结合，对 PPA 估计精度有何影响？
RQ5由于能效与面积节省，基于 NVM 的缓存在支持更多片上资源（如处理单元或更大缓存）方面具有多大潜力？

主要发现

在等容量条件下，STT-MRAM 和 SOT-MRAM 相较于 SRAM 最多可实现 3.8× 和 4.7× 的能量延迟积（EDP）降低。
在相同缓存容量下，STT-MRAM 和 SOT-MRAM 分别实现 2.4× 和 2.8× 的面积缩减。
在等面积假设下，STT-MRAM 和 SOT-MRAM 相较于 SRAM 最多可实现 2× 和 2.3× 的 EDP 降低。
在相同面积预算下，SOT-MRAM 可实现比 SRAM 高达 3.3× 的缓存容量，STT-MRAM 可支持 2.3× 的容量提升。
对于大容量缓存，STT-MRAM 和 SOT-MRAM 在 EDP 表现上相较 SRAM 提升达数量级，显示出卓越的可扩展性。
NVM 带来的能效与延迟节省可被用于增加更多片上资源，如处理单元或更大缓存，从而实现新功能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。