QUICK REVIEW

[论文解读] Near-Data Processing for Machine Learning

Hyeokjun Choe, Se-Il Lee|arXiv (Cornell University)|Apr 24, 2017

Advanced Data Storage Technologies参考文献 12被引用 8

一句话总结

本文提出了一种近数据处理（NDP）框架 ISP-ML，通过在固态硬盘（SSD）内直接执行随机梯度下降（SGD）算法，加速机器学习工作负载，利用多通道并行性。评估结果表明，与传统的主机内处理相比，该方法在性能和能效方面均有显著提升，证明了存储内处理在机器学习工作负载中的可行性。

ABSTRACT

In computer architecture, near-data processing (NDP) refers to augmenting the memory or the storage with processing power so that it can process the data stored therein. By offloading the computational burden of CPU and saving the need for transferring raw data in its entirety, NDP exhibits a great potential for acceleration and power reduction. Despite this potential, specific research activities on NDP have witnessed only limited success until recently, often owing to performance mismatches between logic and memory process technologies that put a limit on the processing capability of memory. Recently, there have been two major changes in the game, igniting the resurgence of NDP with renewed interest. The first is the success of machine learning (ML), which often demands a great deal of computation for training, requiring frequent transfers of big data. The second is the advent of NAND flash-based solid-state drives (SSDs) containing multicore processors that can accommodate extra computation for data processing. Sparked by these application needs and technological support, we evaluate the potential of NDP for ML using a new SSD platform that allows us to simulate in-storage processing (ISP) of ML workloads. Our platform (named ISP-ML) is a full-fledged simulator of a realistic multi-channel SSD that can execute various ML algorithms using the data stored in the SSD. For thorough performance analysis and in-depth comparison with alternatives, we focus on a specific algorithm: stochastic gradient decent (SGD), which is the de facto standard for training differentiable learning machines including deep neural networks. We implement and compare three variants of SGD (synchronous, Downpour, and elastic averaging) using ISP-ML, exploiting the multiple NAND channels for parallelizing SGD. In addition, we compare the performance of ISP and that of conventional in-host processing, revealing the advantages of ISP. Based on the advantages and limitations identified through our experiments, we further discuss directions for future research on ISP for accelerating ML.

研究动机与目标

解决训练机器学习模型（尤其是深度神经网络）时日益增长的计算和数据移动开销问题。
通过利用具备处理能力的SSD实现存储内计算，克服传统基于CPU的处理方式的局限性。
设计并评估一个完整的仿真器（ISP-ML），以建模能够直接在存储设备上执行机器学习算法的真实多通道SSD。
研究三种SGD变体——同步、Downpour和弹性平均——在存储内处理环境下的性能和可扩展性。
将存储内处理（ISP）与传统的主机内处理进行比较，量化NDP在机器学习工作负载中的优势与局限。

提出的方法

开发了一个名为ISP-ML的完整仿真器，用于建模具备嵌入式多核处理器的现实多通道SSD，该处理器能够执行机器学习算法。
在SSD的处理单元中实现了三种随机梯度下降（SGD）变体——同步、Downpour和弹性平均——以支持并行化训练。
利用SSD中多条NAND通道的固有并行性，将SGD计算分布到多个通道上，从而加速计算并减少数据移动。
在SSD上直接仿真端到端的机器学习训练工作负载，绕过主机CPU进行计算，最大限度减少数据传输。
使用相同的机器学习算法，对比了存储内处理（ISP）与传统主机内处理在性能和能效方面的表现。
利用仿真器分析不同SGD变体和SSD配置下的可扩展性、通信开销和资源利用率。

实验结果

研究问题

RQ1与传统的主机内处理相比，存储内处理（ISP）在多大程度上能够加速随机梯度下降（SGD）训练？
RQ2当在多个SSD通道上并行执行时，不同SGD变体（同步、Downpour、弹性平均）的性能表现如何？
RQ3将机器学习计算卸载到基于SSD的处理单元与CPU相比，其性能和能效权衡如何？
RQ4在当前SSD架构中，近数据处理用于机器学习工作负载的关键瓶颈和限制因素是什么？
RQ5现代SSD的架构设计，特别是其多核处理器和多通道内存，如何促进或限制存储内机器学习处理？

主要发现

存储内处理（ISP）通过在SSD内直接执行SGD，显著减少了数据移动，与主机处理相比，延迟更低、吞吐量更高。
现代SSD的多通道架构能够有效并行化SGD计算，通过通道级并发性提升训练吞吐量。
在ISP环境中，弹性平均SGD相比同步和Downpour变体表现出更好的收敛稳定性和可扩展性。
ISP的性能提升在数据密集型机器学习工作负载中最为显著，此时内存带宽和数据传输开销主导了执行时间。
尽管具有优势，但SSD内部有限的处理能力和内存带宽限制了最大加速比，尤其在高度计算密集型模型中更为明显。
由于减少了数据移动并更高效地利用设备内处理资源，能效得到提升，但增益程度取决于工作负载特性和SSD硬件能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。