QUICK REVIEW

[논문 리뷰] To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu, Suyog Gupta|arXiv (Cornell University)|2017. 10. 05.

Advanced Neural Network Applications인용 수 661

한 줄 요약

본 논문은 비전 및 자연어 처리 과제에서 대규모-희소 가지치기 모델과 소형-밀집 모델을 비교하여, 대규모-희소 모델이 종종 동등한 메모리 용량의 밀집 대응 모델보다 우수한 경향이 있으며, 간단한 점진적 가지치기 방법을 도입한다.

ABSTRACT

Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model's dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

연구 동기 및 목표

에너지 효율적 온-디바이스 추론을 위한 모델 압축의 동기 제시.
두 가지 압축 경로 평가: 대규모-희소(가지치기된 큰 모델) 대 소형-밀집(밀집한 소형 모델).
학습 중 적용이 용이한 간단한 점진적 가지치기 기법 개발。

제안 방법

가시화는 전방 패스에서 작은 크기 가중치를 0으로 만들기 위해 가지치기된 각 층에 이진 마스크를 TensorFlow에 확장하여 적용합니다.
n개의 가지치기 단계에서 s_i에서 s_f로 증가하는 점진적 희소도 일정 s_t를 도입하고, 큐빅 스케줄: s_t = s_f + (s_i - s_f)(1 - (t - t_0)/(nΔt))^3를 사용합니다.
Δt의 학습 단계마다 마스크를 업데이트하여 가지치기로 인한 손실의 회복을 가능하게 합니다.
다양한 아키텍처(InceptionV3, MobileNets, 스택드 LSTMs, seq2seq LSTMs, NMT)에 걸쳐 가지치기를 적용합니다.
작업 간 동일한 메모리 용량에서 대규모-희소 모델과 소형-밀집 모델을 비교합니다.

실험 결과

연구 질문

RQ1큰 모델을 가지치기하여 높은 희소성을 달성하는 것이 동일한 메모리 용량의 더 작은 밀집 모델을 학습시키는 것보다 성능이 좋을 수 있는가?
RQ2비전 및 NLP 아키텍처 전반에서 점진적 가지치기가 정확도에 어떤 영향을 미치는가?
RQ3온-디바이스 추론에서 희소 모델과 밀집 모델을 사용할 때의 실용적 하드웨어 및 저장 요구사항은 무엇인가?
RQ4주어진 매개변수 예산에서 정확도를 최대화하는 최적의 희소도 수준이 있는가?

주요 결과

대규모-희소 모델은 과제 전반에서 유사한 메모리 용량을 가진 소형-밀집 모델보다 일관되게 우수한 성능을 보입니다.
InceptionV3에서 50% 희소성은 13.6M NNZ, 상위-1 78.0% 및 상위-5 94.2%를 달성하는 반면 0% 희소성은 27.1M NNZ, 상위-1 78.1% 및 상위-5 94.3%를 보입니다.
87.5% 희소성에서 InceptionV3은 3.3M NNZ로 감소하고 상위-1 74.6%, 상위-5 92.5%로 떨어지나, 큰 압축을 고려하면 이는 비교적 작은 정확도 손실에 해당합니다.
MobileNets는 75% 희소성(1.09M NNZ)으로 67.7% 상위-1에 도달하여 동일 NNZ 예산의 밀집 0.75-폭 모바일 넷을 능가하며, 90–95% 희소화 모델은 동등 규모의 밀집 네트워크보다 더 높은 정확도를 유지합니다.
Penn Tree Bank에서 90% 희소 대형 모델(6.6M NNZ)은 당혹도 80.24로, 중간 크기 밀집 모델(19.8M NNZ)의 83.37보다 더 나은 성능을 보이며, 85% 희소(3.0–3.0M NNZ) 범위는 85.17–85.87의 당혹도 값을 나타내 최적의 압축 범위를 시사합니다.
Google 뉴럴 기계 번역에서 90% 희소(23M NNZ) 모델은 큰 밀집 베이스라인과 거의 같거나 더 나은 BLEU 점수를 달성하며, 80% 희소에서 때때로 BLEU가 약간 개선되기도 합니다; 90% 희소 1024-유닛 모델(23M NNZ)은 밀집 512-유닛 모델(81M 매개변수)과 견줄 만합니다.
전반적으로 대규모-희소 모델은 유리한 트레이드오프를 보여주며, 더 큰 모델을 학습하고 가지치기가 같은 크기에서 더 나은 정확도를 제공하는 경향이 있음을 시사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.