QUICK REVIEW

[논문 리뷰] Movement Pruning: Adaptive Sparsity by Fine-Tuning

Victor Sanh, Thomas Wolf|arXiv (Cornell University)|2020. 05. 15.

Topic Modeling참고 문헌 49인용 수 33

한 줄 요약

이 논문은 accumulated weight movement를 사용해 가지치기 가능한 가중치를 선택하는 일차적이고 미세조정에 민감한 가중치 가지치기 방법인 Movement Pruning을 제안하며, 특히 증류와 함께 BERT와 같은 사전학습된 언어 모델에서 높은 희소성 구간에서 강한 성능을 보인다.

ABSTRACT

Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

연구 동기 및 목표

사전학습된 가중치가 태스크 데이터로 미세조정되는 전이 학습에서 가지치기의 필요성을 제시한다.
movement pruning을 일차적이고 적응적 가지치기 방법으로 소개한다.
수학적 기초를 제공하고 0차 및 1차 가지치 방법과 비교한다.
NLP 태스크에서 고희소성 구간 및 증류를 통한 강력한 성능을 입증한다.

제안 방법

정의 중요도 점수 S와 마스킹 M을 정의해 미세조정 시 가중치를 가지치기한다.
하드 movement pruning에서 straight-through estimator를 사용해 마스크를 학습시키고; ∂L/∂S_{i,j} = ∂L/∂a_i · W_{i,j} · x_j.
고정 임계값과 희소성 규제 항을 둔 소프트 movement pruning 변형을 제공하여 시간이 지남에 따라 점수가 감소하도록 한다.
movement pruning을 L0 규제와 연관시키고 이 프레임워크에서 그래디언트가 어떻게 전달되는지 설명한다.
SQuAD, MNLI, QQP에서 BERT-base-uncased를 사용하고 큐빅 희소성 스케줄과 태스크 특이적 미세조정을 사용해 실험한다.
증류를 도입해 가지치기 시 성능을 높인다.

실험 결과

연구 질문

RQ1이동 가지치기가 NLP의 전이 학습에서 크기 기반 가지치기를 능가할 수 있는가?
RQ2미세조정 중 첫-order 이동 정보가 희소성 패턴 및 성능에 고희소성 수준에서 어떤 영향을 미치는가?
RQ3증류를 사용한 소프트 movement pruning이 모델 크기와 정확도 사이의 더 나은 타협을 제공하는가?
RQ4지역 마스크 대 글로벌 마스크 전략 하에서 가지치기된 모델의 특징은 무엇인가?

주요 결과

방법	SQuAD Dev EM/F1	남은 가중치 (%)	MNLI Dev acc/MM acc	QQP Dev acc/F1
MaP	40.1/54.5	3%	68.9/69.8	72.1/58.4
L0 Regu	61.2/73.3	3%	75.1/75.4	86.5/81.0
MvP	65.2/76.3	3%	76.1/76.7	85.6/81.0
soft MvP	69.5/79.9	3%	79.0/79.6	89.3/85.6

Movement pruning은 고희소성 구간(<15% 가중치 남음)에서 크기 기반 가지치기보다 현저히 우수하다.
Soft movement pruning은 고희소성 설정에서 가지치기 방법 중 가장 강력한 성능을 보여주며, 특히 증류와 함께 더하다.
SQuAD에서 남은 가중치 3%일 때, movement pruning은 65.2/76.3 (EM/F1)를 달성하고 soft movement pruning은 69.5/79.9를 달성한다.
MNLI에서 남은 가중치 3%일 때, soft movement pruning은 79.0/79.6 (acc/mm acc)에 도달한다.
QQP에서 남은 가중치 3%일 때, soft movement pruning은 89.3/85.6 (acc/F1)에 도달한다.
증류는 모든 가지치기 방법에서 성능을 향상시키며, 고희소성에서 강한 성능을 유지한다(예: SQuAD 3%는 증류로 76.6/84.9에 도달).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.