QUICK REVIEW

[Paper Review] SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks

Angshuman Parashar, Minsoo Rhu|arXiv (Cornell University)|May 23, 2017

Advanced Neural Network Applications17 references124 citations

TL;DR

SCNN is a CNN inference accelerator that uses compressed-sparse encodings for weights and activations and a PT-IS-CP-sparse dataflow to achieve substantial speed and energy gains over a dense accelerator. It deploys 64 PEs with 1024 multipliers and emphasizes on-chip data reuse and sparse computation.

ABSTRACT

Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator applied during inference. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to the multiplier array, where they are extensively reused. In addition, the accumulation of multiplication products are performed in a novel accumulator array. Our results show that on contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7x and 2.3x, respectively, over a comparably provisioned dense CNN accelerator.

Motivation & Objective

Motivate and enable efficient CNN inference by leveraging weight pruning and activation sparsity.
Develop a dataflow and hardware architecture that keeps sparse data compressed and reused on-chip.
Minimize data movement and avoid multiplications by zero operands.
Evaluate a sparse CNN accelerator against dense counterparts in terms of throughput, energy, and area.

Proposed method

Introduce PT-IS-CP-sparse dataflow that operates on compressed-sparse blocks of weights and activations.
Use a Cartesian-product multiplier array to compute all non-zero pairwise products.
Employ a scatter accumulator network to sum partial products at correct coordinates.
Structure an on-chip memory hierarchy with IARAM/OARAM and a distributed accumulator bank array to keep data local.
Represent outputs in compressed-sparse form and apply halo handling and layer sequencing to manage tiling across PEs.
Provide both cycle-level simulation and analytical modeling (TimeLoop) to explore dense vs sparse architectures.

Experimental results

Research questions

RQ1How does exploiting weight and activation sparsity impact CNN inference performance and energy on a specialized accelerator?
RQ2What dataflow and hardware design best utilize compressed-sparse representations for weights, inputs, and outputs?
RQ3What are the area, speed, and energy trade-offs of a sparse CNN accelerator compared to dense designs under comparable resources?
RQ4Can all activations fit on-chip for common networks (e.g., AlexNet, GoogLeNet) using compressed representations?
RQ5How do halo/tile strategies influence scalability and energy efficiency in sparse CNN acceleration?

Key findings

A 64-PE SCNN configuration with 1024 multipliers achieves approximately 2 Tera-ops peak throughput.
SCNN delivers about 2.7x speedup and 2.3x energy reduction relative to a comparably provisioned dense CNN accelerator.
Activation and weight sparsity are exploited via compressed-sparse encodings and a PT-IS-CP-sparse dataflow that eliminates unnecessary multiplications.
The SCNN design uses 1 MB of on-chip activation RAM (IARAM+OARAM) and accumulates partial sums across a distributed bank array to support sparse computation.
Area for a single SCNN PE is around 0.123 mm^2, with the full 64-PE accelerator estimated at 7.9 mm^2, driven largely by memory requirements.
The architecture supports tiling and DRAM access-energy accounting through TimeLoop analyses and a cycle-level simulator for performance/power estimation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.