QUICK REVIEW

[Paper Review] Recurrent Neural Networks Hardware Implementation on FPGA

Andre Xian Ming Chang, Berin Martini|arXiv (Cornell University)|Nov 17, 2015

Advanced Neural Network Applications21 references85 citations

TL;DR

This paper presents a hardware implementation of a 2-layer Long Short-Term Memory (LSTM) network with 128 hidden units on a Xilinx Zynq 7020 FPGA, achieving over 21× speedup over the embedded ARM Cortex-A9 CPU. The design uses parallelized matrix-vector operations and AXI DMA for high-bandwidth data streaming, demonstrating real-time inference for character-level language modeling with minimal error propagation.

ABSTRACT

Recurrent Neural Networks (RNNs) have the ability to retain memory and learn data sequences. Due to the recurrent nature of RNNs, it is sometimes hard to parallelize all its computations on conventional hardware. CPUs do not currently offer large parallelism, while GPUs offer limited parallelism due to sequential components of RNN models. In this paper we present a hardware implementation of Long-Short Term Memory (LSTM) recurrent network on the programmable logic Zynq 7020 FPGA from Xilinx. We implemented a RNN with $2$ layers and $128$ hidden units in hardware and it has been tested using a character level language model. The implementation is more than $21 imes$ faster than the ARM CPU embedded on the Zynq 7020 FPGA. This work can potentially evolve to a RNN co-processor for future mobile devices.

Motivation & Objective

To address the computational inefficiency of RNNs on conventional embedded processors like CPUs and GPUs due to sequential dependencies and limited parallelism.
To design a custom FPGA-based hardware accelerator for LSTM networks to enable high-performance, low-power inference on mobile and embedded systems.
To demonstrate real-time inference of a character-level language model using a hardware-implemented LSTM on a Zynq 7020 platform.
To evaluate the accuracy, latency, and energy efficiency of the FPGA implementation compared to CPU and GPU baselines.

Proposed method

Implemented a vanilla LSTM architecture without peephole connections using fixed-point arithmetic for hardware efficiency.
Designed parallelized hardware modules for the four LSTM gates (input, forget, output, candidate cell) using pipelined multipliers and adders.
Utilized AXI DMA interfaces to stream input sequences, weights, and hidden states between the FPGA and external DDR3 memory at 142 MHz.
Optimized memory access by using four concurrent DMA channels to achieve 3.8 GB/s peak bandwidth and reduce bottlenecks.
Mapped the LSTM computation to a Zynq 7020 SoC with a dual-core ARM Cortex-A9 processor handling control and data flow.
Validated the design using a character-level language model trained in Torch7, comparing FPGA output with CPU reference results.

Experimental results

Research questions

RQ1Can an FPGA-based hardware accelerator achieve significantly higher throughput than a general-purpose CPU for LSTM inference on embedded platforms?
RQ2How does the accuracy of the FPGA-implemented LSTM compare to software-based inference, particularly in terms of error propagation over long sequences?
RQ3What is the impact of memory bandwidth on the scalability of parallel LSTM modules on FPGA?
RQ4Can the hardware design maintain low latency and high throughput for small RNN models that do not benefit from GPU acceleration?
RQ5What is the performance-per-power efficiency of the FPGA implementation compared to CPU and GPU platforms?

Key findings

The FPGA-based LSTM implementation achieved a 21.3× speedup over the ARM Cortex-A9 CPU on the same Zynq 7020 platform for a 2-layer, 128-unit character-level language model.
The average percentage error in hidden state (h_t) was 2.8%, and in cell state (c_t) was 3.9%, with no significant error accumulation over 1000 time steps.
The system sustained a peak memory bandwidth of 1.236 GB/s using four AXI DMA channels, limiting the number of parallel LSTM modules due to external memory constraints.
The FPGA implementation outperformed the GPU on a MacBook Pro 2016, which took 0.569s vs. 0.304s on CPU, due to high memory copy overhead for small models.
Performance per unit power was significantly higher on the FPGA than on CPU or GPU, indicating strong energy efficiency for embedded inference.
The generated text from the FPGA and CPU implementations was qualitatively similar, both producing Shakespeare-like dialogues, confirming functional correctness.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.