QUICK REVIEW

[논문 리뷰] Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks

Yu‐Hsin Chen, Joel Emer|arXiv (Cornell University)|2018. 07. 10.

Advanced Neural Network Applications참고 문헌 16인용 수 65

한 줄 요약

Eyeriss v2는 다양한 DNN 워크로드의 데이터 재사용 및 대역폭 요구사항을 효율적으로 처리할 수 있도록 유연하고 고성능인 DNN 가속기입니다. Row-Stationary Plus (RS+) 데이터플로우와 계층적 메시 NoC를 도입하여, 256개의 PE에서 Eyeriss 대비 10.4x–17.9x 높은 성능을 달성하고, 16384개의 PE에서는 최대 1086.7x의 성능 향상을 보이며 다양한 DNN에서 뛰어난 확장성과 적응성을 입증합니다.

ABSTRACT

The design of DNNs has increasingly focused on reducing the computational complexity in addition to improving accuracy. While emerging DNNs tend to have fewer weights and operations, they also reduce the amount of data reuse with more widely varying layer shapes and sizes. This leads to a diverse set of DNNs, ranging from large ones with high reuse (e.g., AlexNet) to compact ones with high bandwidth requirements (e.g., MobileNet). However, many existing DNN processors depend on certain DNN properties, e.g., a large number of channels, to achieve high performance and energy efficiency and do not have sufficient flexibility to efficiently process a diverse set of DNNs. In this work, we present Eyexam, a performance analysis framework that quantitatively identifies the sources of performance loss in DNN processors. It highlights two architectural bottlenecks in many existing designs. First, their dataflows are not flexible enough to adapt to the varying layer shapes and sizes of different DNNs. Second, their network-on-chip (NoC) can't adapt to support both high data reuse and high bandwidth scenarios. Based on this analysis, we present Eyeriss v2, a high-performance DNN accelerator that adapts to a wide range of DNNs. Eyeriss v2 has a new dataflow, called Row-Stationary Plus (RS+), that enables the spatial tiling of data from all dimensions to fully utilize the parallelism for high performance. To support RS+, it has a low-cost and scalable NoC design, called hierarchical mesh, that connects the high-bandwidth global buffer to the array of processing elements (PEs) in a two-level hierarchy. This enables high-bandwidth data delivery while still being able to harness any available data reuse. Compared with Eyeriss, Eyeriss v2 has a performance increase of 10.4x-17.9x for 256 PEs, 37.7x-71.5x for 1024 PEs, and 448.8x-1086.7x for 16384 PEs on DNNs with widely varying amounts of data reuse.

연구 동기 및 목표

다양한 레이어 형상과 데이터 재사용 패턴을 가지는 신규 DNN에 대응하지 못하는 기존 DNN 가속기의 성능 한계를 해결하기 위해.
현재 DNN 프로세서에서 발생하는 아키텍처적 병목 현상, 특히 유연하지 못한 데이터플로우와 비적응형 NoC를 규명하기 위해.
높은 데이터 재사용과 높은 대역폭 워크로드를 모두 효율적으로 지원할 수 있는 새로운 가속기 설계를 위해.
작은 구조에서 큰 아키텍처에 이르기까지 다양한 DNN 모델에서 확장 가능하고 고성능의 추론을 가능하게 하기 위해.

제안 방법

DNN 프로세서의 성능 병목 현상을 정량적으로 규명하기 위해 성능 분 析 프레임워크인 Eyexam을 제안합니다.
모든 차원에서 데이터를 스페이셜 타일링할 수 있도록 해, 최대한의 병렬성과 재사용을 극대화하는 Row-Stationary Plus (RS+) 데이터플로우를 도입합니다.
글로벌 버퍼를 처리 요소(Pe)에 연결하기 위해 두 수준의 계층적 구조를 가진 메시 네트워크온칩(NoC)을 설계하여 확장 가능하고 저비용의 대역폭 제공을 구현합니다.
두 수준의 NoC 아키텍처를 통해 성능을 저하시키지 않고도 높은 대역폭과 높은 재사용 시나리오를 모두 지원합니다.
RS+ 데이터플로우를 계층적 NoC와 정렬시켜 외부 메모리 액세스를 최소화함으로써 데이터 이동을 최적화합니다.
탄력적인 타일링과 확장 가능한 인터커넥트를 조합하여 다양한 DNN 워크로드에 동적으로 적응할 수 있도록 합니다.

실험 결과

연구 질문

RQ1다양한 DNN 워크로드를 처리할 때 기존 DNN 가속기의 성능을 제한하는 아키텍처적 병목 현상은 무엇인가요?
RQ2어떻게 하면 DNN 가속기가 높은 데이터 재사용과 높은 대역폭 워크로드를 모두 효율적으로 지원할 수 있나요?
RQ3탄력적인 데이터플로우와 확장 가능한 NoC 설계가 다양한 DNN 모델에서 고성능을 달성할 수 있나요?
RQ4재구성 가능한 데이터플로우와 계층적 NoC가 성능과 에너지 효율성에 얼마나 기여할 수 있나요?
RQ5Eyeriss v2의 성능는 다양한 DNN에서 PE 수가 증가함에 따라 어떻게 확장되나요?

주요 결과

Eyeriss v2는 다양한 DNN에서 256개의 PE에서 Eyeriss 대비 10.4x–17.9x 높은 성능을 달성합니다.
1024개의 PE에서 Eyeriss v2는 Eyeriss 대비 37.7x–71.5x 높은 성능을 제공합니다.
16384개의 PE에서 Eyeriss v2는 Eyeriss 대비 448.8x–1086.7x 높은 성능 향상을 달성하여 강력한 확장성을 입증합니다.
Row-Stationary Plus (RS+) 데이터플로우 덕분에 모든 데이터 차원에서 병렬성을 완전히 활용할 수 있었으며, 자원 활용도가 향상되었습니다.
계층적 메시 NoC는 확장성에 손상이 가지 않도록 하면서도 높은 대역폭과 높은 재사용 워크로드를 효과적으로 지원합니다.
Eyexam 분석 결과, 기존 DNN 가속기에서의 성능 병목 현상의 핵심 원인은 유연하지 못한 데이터플로우와 비적응형 NoC임을 규명했습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.