[논문 리뷰] A Fast Post-Training Pruning Framework for Transformers
Fisher 기반 마스크 검색, 재배열, 및 마스크 튜닝을 사용하여 heads와 FFN 필터를 가지치고 정확도를 유지하면서 FLOPs 및 지연(latency)을 크게 감소시킬 수 있는 Transformer 가지치기 프레임워크.
Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-base and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain the models.
연구 동기 및 목표
- FLOPs/지연 제약 하에 배치를 위한 실용적이고 retraining-free(재학습 없는) Transformer 모델 압축 동기를 제시한다.
- 정보 이론에 의해 가이드되는 세 단계 가지치기 파이프라인(마스크 검색, 재배열, 튜닝)을 개발하여 가지치기할 헤드와 FFN 필터를 선택한다.
- 제한된 예산 내에서 정확도를 유지하면서 전체 재학습을 피하고 작은 데이터 샘플에서 신속한 가지치기를 가능하게 한다.
제안 방법
- Represent pruning as a constrained mask optimization over MHA heads and FFN filters with binary masks.
- Use a Fisher information-based mask search to select which heads/filters to prune under a FLOPs/latency budget.
- Apply a mask rearrangement stage to capture intra-layer interactions via a block-diagonal Fisher approximation.
- Perform a mask tuning stage that reconstructs layer activations by solving a layer-wise linear least squares problem.
- Extend the approach to latency constraints by approximating latency with a piece-wise linear model and adapting the search accordingly.
- Demonstrate pruning on BERT_BASE and DistilBERT, evaluating on GLUE and SQuAD with minimal accuracy loss.
실험 결과
연구 질문
- RQ1Can Transformer models be pruned to meet FLOPs/latency constraints without retraining?
- RQ2How can we identify which heads and FFN filters to prune so as to minimize accuracy loss under a given resource constraint?
- RQ3Does a post-training pruning pipeline with mask search, rearrangement, and tuning outperform retraining-based pruning methods in efficiency and accuracy trade-offs?
- RQ4What is the practical speedup achievable on real hardware when using retraining-free pruning on common benchmarks?
주요 결과
- The framework achieves up to 2.0× reduction in FLOPs and up to 1.56× speedup in inference latency with less than 1% accuracy loss.
- Pruning can be completed in under 3 minutes on a single GPU, which is over two orders of magnitude faster than retraining-based pruning methods.
- On GLUE and SQuAD, pruning BERT_BASE and DistilBERT with 1% accuracy loss yields substantial FLOPs reductions (60–70% of original FLOPs for BERT_BASE on several tasks; ~50% for DistilBERT).
- Latency experiments on NVIDIA V100 show average speedups around 1.47× to 1.56× at batch size 256 under a 1% accuracy constraint.
- The proposed Fisher-based mask search, rearrangement, and tuning stages each contribute to recovering or preserving accuracy, with mask tuning playing a critical role in accuracy restoration.
- Compared to prior structured pruning methods, the retraining-free approach achieves comparable or better FLOPs-accuracy trade-offs with substantially lower pruning costs (end-to-end pruning times under a minute).
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.