QUICK REVIEW

[논문 리뷰] ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA

Song Han, Junlong Kang|arXiv (Cornell University)|2016. 12. 01.

Speech Recognition and Synthesis인용 수 38

한 줄 요약

이 논문은 LSTM 모델을 로드 밸런스 인식 프루닝과 양자화를 통해 압축하는 효율적인 음성 인식 엔진인 ESE를 제안한다. 이로 인해 정확도 손실을 최소화하면서 모델 크기를 20배 축소할 수 있었다. 하드웨어 기반 스케줄러와 맞춤형 FPGA 아키텍처를 XCKU060에 구현하여 41W에서 282 GOPS의 성능를 달성했으며, 이는 CPU 대비 43배 빠르고 40배 더 에너지 효율적이며 GPU 대비 3배 빠른 성능을 기록했다.

ABSTRACT

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built larger and larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to high total cost of ownership (TCO) of a data center. In order to speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose scheduler that encodes and partitions the compressed model to each PE for parallelism, and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the compressed model. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the compressed LSTM network, corresponding to 2.52 TOPS on the uncompressed one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

연구 동기 및 목표

음성 인식에서 대규모 LSTM 모델의 높은 계산 및 메모리 요구량을 해결하기 위해.
데이터 센터의 총 소유비용(TCO)이 높은 환경에서의 구현을 위해 모델 크기와 에너지 소비를 줄이기 위해.
예측 정확도를 유지하면서도 고처리량, 저전력 추론을 가능하게 하기 위해 LSTM 모델을 압축하기 위해.
압축된 모델을 효율적으로 처리할 수 있는 전문화된 하드웨어 아키텍처를 설계하기 위해.
CPU 및 GPU 구현 대비 뛰어난 성능과 에너지 효율성을 달성하기 위해.

제안 방법

모델 정확도를 유지하면서 10배의 모델 크기 감소를 달성하기 위해 로드 밸런스 인식 프루닝 방법을 제안한다.
추가로 양자화를 적용하여 모델을 2배 더 압축함으로써 총 20배의 모델 크기 감소를 달성한다.
압축된 모델을 처리 요소(PE) 간에 분할하여 병렬 실행을 위한 스케줄러를 설계한다.
LSTM 계산의 복잡한 순차적 의존성을 관리하기 위해 데이터플로우 기반 스케줄링 알고리즘을 개발한다.
압축된 LSTM 모델에 최적화된 Xilinx XCKU060 FPGA에 맞춤형 하드웨어 아키텍처인 ESE를 구현한다.
스케줄러와 하드웨어 파이프라인을 통합하여 압축된 모델에서 효율적이고 실시간 추론을 가능하게 한다.

실험 결과

연구 질문

RQ1로드 밸런스 인식 프루닝과 양자화를 통해 LSTM 모델 크기를 20배 줄일 수 있으며 정확도 손실이 거의 없는가?
RQ2맞춤형 스케줄러는 FPGA에서 압축된 LSTM 모델의 병렬성을 얼마나 효과적으로 활용할 수 있는가?
RQ3CPU 및 GPU 대비 FPGA에 압축된 LSTM 모델을 구현할 경우 성능 및 에너지 효율성 향상은 얼마나 이루어지는가?
RQ4제안된 하드웨어 아키텍처는 높은 처리량을 유지하면서도 낮은 전력 소비를 유지할 수 있는가?
RQ5모델 압축은 자원이 제한된 하드웨어 플랫폼에서 효율적인 추론을 얼마나 높일 수 있는가?

주요 결과

ESE 시스템은 XCKU060 FPGA에서 200MHz로 실행할 때 압축된 LSTM 모델에서 282 GOPS의 성능를 기록한다.
ESE의 실질적 처리 능력은 원본 압축되지 않은 모델 기준 2.52 TOPS에 해당한다.
ESE는 음성 인식을 위한 전체 LSTM을 단 41W의 전력 소비로 처리할 수 있다.
Core i7 5930k CPU 대비 ESE는 추론 속도에서 43배 빠르다.
Pascal Titan X GPU 대비 ESE는 처리 시간에서 3배 빠르다.
ESE는 CPU 대비 40배 높은 에너지 효율성을 기록했으며, GPU 대비 11.5배 높은 에너지 효율성을 달성했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.