[论文解读] Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks
Eyeriss v2 是一种灵活且高性能的深度神经网络(DNN)加速器,引入了行-驻留增强型(Row-Stationary Plus, RS+)数据流和分层网格片上网络(NoC),可高效处理具有不同数据重用和带宽需求的多样化DNN工作负载。在256个处理元素(PE)上,其性能比Eyeriss高出10.4倍至17.9倍,在16384个PE上最高可达1086.7倍,展现出在多种DNN上出色的可扩展性和适应性。
The design of DNNs has increasingly focused on reducing the computational complexity in addition to improving accuracy. While emerging DNNs tend to have fewer weights and operations, they also reduce the amount of data reuse with more widely varying layer shapes and sizes. This leads to a diverse set of DNNs, ranging from large ones with high reuse (e.g., AlexNet) to compact ones with high bandwidth requirements (e.g., MobileNet). However, many existing DNN processors depend on certain DNN properties, e.g., a large number of channels, to achieve high performance and energy efficiency and do not have sufficient flexibility to efficiently process a diverse set of DNNs. In this work, we present Eyexam, a performance analysis framework that quantitatively identifies the sources of performance loss in DNN processors. It highlights two architectural bottlenecks in many existing designs. First, their dataflows are not flexible enough to adapt to the varying layer shapes and sizes of different DNNs. Second, their network-on-chip (NoC) can't adapt to support both high data reuse and high bandwidth scenarios. Based on this analysis, we present Eyeriss v2, a high-performance DNN accelerator that adapts to a wide range of DNNs. Eyeriss v2 has a new dataflow, called Row-Stationary Plus (RS+), that enables the spatial tiling of data from all dimensions to fully utilize the parallelism for high performance. To support RS+, it has a low-cost and scalable NoC design, called hierarchical mesh, that connects the high-bandwidth global buffer to the array of processing elements (PEs) in a two-level hierarchy. This enables high-bandwidth data delivery while still being able to harness any available data reuse. Compared with Eyeriss, Eyeriss v2 has a performance increase of 10.4x-17.9x for 256 PEs, 37.7x-71.5x for 1024 PEs, and 448.8x-1086.7x for 16384 PEs on DNNs with widely varying amounts of data reuse.
研究动机与目标
- 为解决现有DNN加速器在处理具有高度可变层形状和数据重用模式的新兴DNN时性能受限的问题。
- 识别当前DNN处理器中的架构瓶颈,特别是不灵活的数据流和非自适应的片上网络(NoC)。
- 设计一种新型加速器,能够高效支持高数据重用和高带宽工作负载。
- 实现在从紧凑到大型架构的广泛DNN模型中可扩展、高性能的推理。
提出的方法
- 提出性能分析框架Eyexam,以定量识别DNN处理器中的性能瓶颈。
- 引入行-驻留增强型(RS+)数据流,实现所有维度上的数据空间分块,以最大化并行性并充分利用数据重用。
- 设计一种分层网格片上网络(NoC),通过两级层次结构将全局缓冲区与处理元素(PE)连接,实现可扩展且低成本的带宽传输。
- 采用两级NoC架构,在不牺牲性能的前提下,同时支持高带宽和高重用场景。
- 通过将RS+数据流与分层NoC对齐,优化数据移动,最大限度减少对外部内存的访问。
- 通过结合灵活的分块机制与可扩展互连网络,实现对多样化DNN工作负载的动态适应。
实验结果
研究问题
- RQ1当处理多样化DNN工作负载时,现有DNN加速器的哪些架构瓶颈限制了其性能?
- RQ2如何使DNN加速器高效支持高数据重用和高带宽工作负载?
- RQ3灵活的数据流与可扩展的NoC设计是否能实现在广泛DNN模型上的高性能?
- RQ4可重构数据流与分层NoC在多大程度上能提升性能和能效?
- RQ5Eyeriss v2在不同DNN上,随着处理元素(PE)数量的增加,其性能如何扩展?
主要发现
- 在256个PE上,Eyeriss v2在多种DNN上相比Eyeriss实现了10.4倍至17.9倍的性能提升。
- 在1024个PE上,Eyeriss v2相比Eyeriss实现了37.7倍至71.5倍的性能提升。
- 在16384个PE上,Eyeriss v2相比Eyeriss实现了448.8倍至1086.7倍的性能增益,展现出强大的可扩展性。
- 行-驻留增强型(RS+)数据流能够全面挖掘所有数据维度上的并行性,提升资源利用率。
- 分层网格NoC能有效支持高带宽和高重用工作负载,且不牺牲可扩展性。
- Eyexam分析揭示,不灵活的数据流和非自适应的NoC是现有DNN加速器中的主要性能瓶颈。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。