QUICK REVIEW

[论文解读] ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA

Song Han, Junlong Kang|arXiv (Cornell University)|Dec 1, 2016

Speech Recognition and Synthesis被引用 38

一句话总结

该论文提出ESE，一种高效的语音识别引擎，通过负载均衡感知的剪枝与量化技术压缩LSTM模型，在保持最小精度损失的前提下实现20倍的模型尺寸缩减。该系统实现了硬件感知的调度器，并在XCKU060 FPGA上部署了定制化硬件架构，实现41W功耗下282 GOPS的性能——相比CPU快43倍、能效高40倍，相比GPU快3倍。

ABSTRACT

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

研究动机与目标

解决大规模LSTM模型在语音识别中带来的高计算与内存需求。
降低模型尺寸与能耗，以适应总拥有成本（TCO）较高的数据中心部署。
通过压缩LSTM模型，在保持预测精度的前提下实现高吞吐量、低功耗推理。
设计专用硬件架构，高效处理压缩后的LSTM模型。
相比CPU与GPU实现更优的性能与能效表现。

提出的方法

提出一种负载均衡感知的剪枝方法，将LSTM模型尺寸压缩10倍，同时保持模型精度。
应用量化技术进一步压缩模型2倍，实现总20倍的模型尺寸缩减。
设计调度器，将压缩后的模型划分为多个处理单元（PE）以实现并行执行。
开发数据流感知的调度算法，以管理LSTM计算中的复杂顺序依赖关系。
在Xilinx XCKU060 FPGA上实现定制化硬件架构ESE，专为压缩后的LSTM模型优化。
集成调度器与硬件流水线，实现在压缩模型上的高效、实时推理。

实验结果

研究问题

RQ1负载均衡感知的剪枝与量化是否可实现20倍的LSTM模型尺寸缩减，且精度损失可忽略？
RQ2定制化调度器在FPGA上对压缩后的LSTM模型并行性利用效率如何？
RQ3与CPU和GPU相比，将压缩后的LSTM模型部署在FPGA上可获得多大的性能与能效提升？
RQ4所提出的硬件架构是否能在保持低功耗的同时维持高吞吐量？
RQ5模型压缩在资源受限硬件平台上的高效推理中能发挥多大作用？

主要发现

在XCKU060 FPGA上以200MHz频率运行时，ESE系统在压缩后的LSTM模型上实现了282 GOPS的性能。
ESE的有效吞吐量相当于原始未压缩模型的2.52 TOPS。
ESE在处理完整LSTM语音识别任务时，功耗仅为41瓦。
与Core i7 5930k CPU相比，ESE的推理速度快43倍。
与Pascal Titan X GPU相比，ESE的处理时间快3倍。
ESE的能效比CPU高40倍，比GPU高11.5倍。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。