QUICK REVIEW

[论文解读] ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Song Han, Junlong Kang|arXiv (Cornell University)|Dec 1, 2016

Speech Recognition and Synthesis参考文献 18被引用 61

一句话总结

本文提出ESE，一种基于FPGA的语音识别加速器，在压缩的稀疏LSTM模型上实现了282 GOPS的性能，通过负载均衡感知的剪枝与量化技术，将模型大小减少20倍，同时保持极低的精度损失。该系统相比CPU实现43倍的推理速度提升和40倍的能量效率提升，通过优化调度与硬件感知的稀疏计算，实现在FPGA上实时、低功耗的语音识别。

ABSTRACT

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

研究动机与目标

解决语音识别系统中大规模LSTM模型带来的高计算与内存需求。
在不牺牲预测精度的前提下，减小模型大小与内存带宽需求。
设计一种面向FPGA上压缩稀疏LSTM模型的高效推理硬件-软件协同优化加速器。
在实时语音识别工作负载中实现高硬件利用率与高能效。
实现多个语音流的并发处理，具备低延迟与高吞吐量。

提出的方法

提出一种负载均衡感知的剪枝方法，使LSTM模型压缩10倍，同时保持精度，随后进行2倍的权重量化。
开发一种自动的动态精度量化流程，以减小模型大小与内存占用。
设计一种调度器，将复杂的LSTM数据流映射到多个处理单元（PEs），实现计算与内存访问的重叠。
实现一种硬件架构（ESE），原生支持稀疏LSTM的权重量与激活值处理，有效利用不规则的稀疏模式。
采用相对索引压缩的稀疏列存储（CSC）格式，实现稀疏权重重矩阵的高效存储与访问。
将计算与存储任务在PE之间进行划分，以实现负载均衡，最大化FPGA上的并行性。

实验结果

研究问题

RQ1如何在保持预测精度的前提下，有效压缩LSTM模型以部署于资源受限的硬件？
RQ2何种调度策略可实现在FPGA上高效执行具有复杂数据依赖关系的循环神经网络工作负载？
RQ3如何设计硬件架构以高效利用LSTM等循环网络中的稀疏性？
RQ4与CPU和GPU相比，基于FPGA的加速器在稀疏LSTM推理中可达到的性能与能效水平如何？
RQ5硬件-软件协同设计是否能显著提升实时语音识别中的推理速度与能效？

主要发现

在Xilinx XCKU060 FPGA上以200 MHz运行时，ESE在稀疏LSTM模型上实现282 GOPS性能，相当于原始密集模型的2.52 TOPS。
模型压缩技术使模型大小减少20倍（其中剪枝贡献10倍，量化贡献2倍），精度损失可忽略不计。
与Core i7-5930K CPU相比，ESE实现43倍加速与40倍能效提升；与Pascal Titan X GPU相比，实现3倍加速与11.5倍能效提升。
负载均衡感知剪枝提升了硬件利用率，通过减少处理单元中的空闲周期，实现更高吞吐量。
调度器成功实现了计算与内存访问的重叠，显著降低了LSTM网络递归数据流中的延迟。
ESE支持多个语音流的并发处理，展示了在FPGA上的可扩展性与实时性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。