QUICK REVIEW

[논문 리뷰] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao|arXiv (Cornell University)|2015. 10. 01.

Advanced Neural Network Applications참고 문헌 22인용 수 3,526

한 줄 요약

세 단계 파이프라인을 도입—가지치기, 가중치 공유와 함께하는 학습된 양자화, 그리고 Huffman 부호화를 통해 정확도 손실 없이 딥 네트워크를 압축하고 칩 내 저장소와 에너지 효율성을 가능하게 한다.

ABSTRACT

Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

연구 동기 및 목표

모바일 및 임베디드 배치를 위한 딥 뉴럴 네트워크의 저장소 및 메모리 대역폭 요구사항 감소.
원래 정확도 유지하면서 모델 매개변수 크게 압축.
모델 크기를 칩 내 메모리에 맞추기 위해 on-chip SRAM 캐싱 가능.
ImageNet와 MNIST에서 여러 아키텍처(LeNet, AlexNet, VGG-16)에서 압축 이득 시연.

제안 방법

네트워크 가지치기를 수행하여 중요도가 낮은 연결을 제거하고 남은 가중치를 재학습한다.
학습된 양자화를 적용하여 가중치 공유를 생성하기 위해 가중치를 클러스터링하고 작은 코드북과 인덱스를 저장한다.
양자화 후 공유 가중치를 미세 조정하기 위한 재학습.
Huffman 부호화를 적용하여 비균일한 가중치 및 인덱스 분포를 활용한 추가 압축.
MNIST 및 ImageNet 벤치마크에서 압축 평가, 저장소 절감 및 정확도 보고.

실험 결과

연구 질문

RQ1대형 CNN에서 가지치기가 정확도 손실 없이 중복 연결을 제거할 수 있는가?
RQ2학습된 양자화를 통한 가중치 공유가 성능을 유지하면서 저장소를 얼마나 줄일 수 있는가?
RQ3Huffman 부호화가 가지치기 및 양자화보다 추가 압축을 제공하는가, 그리고 그 정도는?
RQ4실제 하드웨어에서 Deep Compression의 실용적 저장소, 속도 및 에너지 영향은 무엇인가?
RQ5이러한 기술들이 아키텍처(leNet, AlexNet, VGG-16)와 데이터셋(MNIST, ImageNet) 간에 어떻게 상호작용하는가?

주요 결과

모델 저장소를 정확도 손실 없이 35×~49×까지 감소시킴.
AlexNet은 240MB에서 6.9MB로 축소(35×); VGG-16은 552MB에서 11.3MB로 축소(49×).
가지치기 만으로 매개변수 9×~13× 감소; 양자화는 연결당 비트를 32에서 5까지 낮춤; Huffman 부호화가 추가로 20%–30% 압축을 더한다.
가지치기와 양자화는 보완적이며 함께 사용하면 원래 크기의 약 3% 정도까지 정확도 손실 없이 도달 가능.
압축은 온칩 SRAM 저장을 가능하게 하여 에너지를 줄이고 모바일 배치를 가능하게 한다; 비배치 추론에서 3×–4× 속도 향상 및 3×–7× 에너지 효율 향상을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.