QUICK REVIEW

[论文解读] SCALE-Sim: Systolic CNN Accelerator Simulator

Ananda Samajdar, Yuhao Zhu|arXiv (Cornell University)|Oct 16, 2018

Energy Harvesting in Wireless Networks参考文献 33被引用 194

一句话总结

SCALE-Sim 是一个开源、逐周期精确的 systolic-array CNN 加速器仿真器，能够探索数据流、阵列形状、内存大小和系统集成对性能与能效的影响。

ABSTRACT

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners.

研究动机与目标

识别 systolic-array CNN 加速器的关键设计参数及其相互作用。
提供一个逐周期精确、开源的工具用于快速设计空间探索。
演示数据流、内存大小、阵列形状和系统集成如何在 CNN 工作负载下影响性能与能耗。

提出的方法

将计算建模为能够执行矩阵-矩阵、矩阵-向量和向量-向量运算的 2D systolic MAC 单元阵列。
支持三种数据流（输出站存、权重站存、输入站存），并捕获它们对复用和带宽的影响。
实现一个可参数化的片上内存层次结构，具有三个逻辑分区（IFMAP、滤波器、OFMAP）和双缓冲内存以隐藏延迟。
通过对主处理器的从接口对接实现系统集成，生成 SRAM/DRAM 流量并能够估算 DRAM 带宽。
从按层拓扑 CSV 和架构配置生成逐周期的流量跟踪和摘要指标；并与 OS 数据流的 RTL 进行验证。

实验结果

研究问题

RQ1数据流选择（OS/WS/IS）如何与阵列尺寸和工作负载超参数相互作用，影响 systolic CNN 加速器的性能与能耗？
RQ2为了在 CNN 工作负载下实现无阻塞运行和有利的能耗特性，需要的内存尺寸要求（scratchpad 限制）是什么？
RQ3阵列形状（纵横比）在不同数据流下对常见 DNN 工作负载的性能有何影响？
RQ4在固定计算预算下，扩展规模（更大阵列）与扩展数量（更多阵列）之间有哪些权衡？
RQ5是否可以在不同网络拓扑中有效使用单一数据流，还是数据流定制对于效率至关重要？

主要发现

在所考察的工作负载中，OS 数据流通常提供最佳性能，但需考虑无阻塞实现和硬件成本。
IS 与 WS 可能需要更少的 SRAM bank 以适应平方阵列，并且在工作负载与阵列大小不同情况下表现各异；较小的阵列可能更利于 IS。
更大的片上 scratchpad 内存减少了片外带宽需求与能耗，但在取决于工作负载的拐点后收益递减。
阵列形状与数据流以复杂方式交互；在某些网络中，若不调整数据流，细长阵列表现较差，而方形阵列通常整体表现良好。
扩大规模（scale up）与横向扩展（scale out）对 DRAM 带宽和性能有不同的影响，取决于工作负载和数据流，凸显加速器扩展中的非平凡权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。