QUICK REVIEW

[논문 리뷰] Rethinking the Value of Network Pruning

Zhuang Liu, Mingjie Sun|arXiv (Cornell University)|2018. 10. 11.

Anomaly Detection Techniques and Applications인용 수 742

한 줄 요약

구조화 가지치기(structured pruning)에 대해, 잘라낸 모델을 처음부터 훈련시키는 것이 상속된 가중치를 이용한 미세 조정과 종종 일치하거나 그 이상일 때가 많고, 가지치기된 아키텍처 자체가 효율성의 핵심 주도 변수이며, 가지치기가 아키텍처 검색으로 작용할 수 있음을 시사한다.

ABSTRACT

Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.

연구 동기 및 목표

프로그래밍에서 큰 과잉 매개변수 모델을 가지치기 전에 학습할 필요가 있는지 의문을 제기한다.
상속된 가중치를 가진 가지치기된 모델의 미세 조정이 처음부터 가지치기된 모델을 학습하는 것보다 우수한지 평가한다.
미리 정의된(target) 가지치기 대상과 자동으로 발견된(target) 가지치기 대상의 차이를 구분한다.
가지치기가 가중치 선택보다는 주로 아키텍처 검색으로 기능하는지 평가한다.
구조화 가지치기와 비구조적 가지치기를 비교하고 Lottery Ticket Hypothesis와의 관계를 논의한다.

제안 방법

가지치기를 미리 정의된 대상 아키텍처와 자동으로 발견된 대상 아키텍처로 분류한다.
scratch-E, Scratch-B 등으로 처음부터 가지치기된 모델을 학습(Scratch) vs 부모 가중치를 이용한 미세 조정(Fine-tune)
다양한 가지치기 방법(L1-norm 필터 가지치기, ThiNet, 회귀 기반 재구성, Network Slimming, Sparse Structure Selection) 및 비구조적 마그니튜드 기반 가지치기를 적용한다.
CIFAR-10, CIFAR-100, ImageNet에서 VGG, ResNet, DenseNet 변형들로 평가한다.
가지치기된 아키텍처의 매개변수 효율성과 희소성 패턴을 분석한다.
Lottery Ticket Hypothesis와의 비교 및 아키텍처 검색에 대한 시사점을 논의한다.

실험 결과

연구 질문

RQ1상속 가중치를 가진 미세 조정이 predefined 및 automatic 가지치기 대상에서 같은 가지치기된 아키텍처를 처음부터 학습하는 것보다 우수한가?
RQ2가장 최종적인 효율성과 정확도를 결정하는 것은 preserved된 가중치가 아닌 가지치기된 아키텍처의 어떤 정도인가?
RQ3큰 규모의 사전학습 없이도 매개변수 효율적인 아키텍처를 얻는 효과적인 아키텍처 검색 방법으로 가지치기가 작동할 수 있는가?
RQ4구조화 가지치기와 비구조적 가지치기가 ImageNet 같은 대규모 데이터셋에서 처음부터 가지치기된 모델을 학습하는 능력에 있어 어떻게 비교되는가?

주요 결과

미리 정의된 구조화 가지치기의 경우, Scratch로 훈련된 모델이 미세 조정된 대조군의 정확도에 도달하거나 이를 상회하며, Scratch-B가 종종 Scratch-E보다 낫고 때때로 ImageNet의 미세 조정보다도 뛰어나다.
자동 구조화 가지치기의 경우 Scratch로 학습된 가지치기된 모델이 일반적으로 미세 조정된 모델과 맞먹거나 이기며, Scratch-B가 자주 우수하다.
ImageNet에서 비구조적 가지치기는 scratch 학습이 미세 조정보다 성능이 떨어지는 경향을 보이며, 구조화 가지치기와의 차이를 강조한다.
자동 가지치기 방법으로 얻은 가지치기된 아키텍처는 균일하게 가지치기된 아키텍처보다 매개변수 효율이 더 높아 아키텍처 검색의 가치를 시사한다.
가이드된/가지치기된 아키텍처는 다른 모델/데이터셋으로 설계 패턴을 전달할 수 있어 특정 가지치기 모델 이외의 실용적 설계 원칙을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.