QUICK REVIEW

[Paper Review] Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir, Ariel Larey|arXiv (Cornell University)|Nov 10, 2021

Topic Modeling26 citations

TL;DR

This paper introduces Prune Once for All (Prune OFA), an architecture-agnostic method to train sparse pre-trained Transformer language models by integrating weight pruning and distillation, enabling high sparsity (e.g., 85–90%) with minimal accuracy loss across downstream tasks, plus optional quantization.

ABSTRACT

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

Motivation & Objective

Motivate the need for efficient, deployable Transformer LMs due to growing model sizes and environmental costs.
Propose an architecture-agnostic method (Prune OFA) to train sparse pre-trained LMs that retain transfer learning capabilities.
Show that sparse pre-trained models can be fine-tuned on multiple downstream tasks with minimal accuracy loss.
Demonstrate that subsequent quantization further reduces model size with modest accuracy impact, and release reproducible tooling and models.

Proposed method

Use unstructured weight pruning during a single pre-training/knowledge transfer process to obtain a sparse pre-trained LM.
Incorporate Gradual Magnitude Pruning (GMP) with Learning Rate Rewinding (LRR) and knowledge distillation (KD) during the pruning process.
Apply a pattern-lock mechanism to preserve the sparsity pattern during downstream fine-tuning.
Perform pre-training on English Wikipedia, then transfer to downstream tasks (SQuADv1.1, GLUE tasks) with KD to preserve performance.
Optionally apply Quantization-Aware Training (QAT) to obtain 8-bit quantized sparse models.
Provide an open-source compression library with scripts and sparse pre-trained models for reproducibility.

Experimental results

Research questions

RQ1Can pruning during pre-training yield sparse pre-trained language models that transfer to downstream tasks with minimal accuracy loss?
RQ2Does combining GMP, LRR, and KD during pruning improve transfer performance over task-specific pruning?
RQ3Does preserving the sparsity pattern (pattern-lock) help maintain accuracy during fine-tuning?
RQ4How does downstream quantization (8-bit QAT) affect the compression-to-accuracy trade-off for sparse pre-trained models?

Key findings

Model	Sparsity	Transfer with KD	SQuAD	MNLI (m/mm)	SST-2	QNLI	QQP
Prune OFA (BERT-Base)	85%	Yes	78.59	86.63	81.67	82.53	91.34	89.95
Prune OFA (BERT-Base)	85%	No	78.00	86.16	82.45	83.05	88.82	87.79
Prune OFA (BERT-Base)	85%	Yes	81.10	88.42	82.71	83.67	91.46	90.34

Prune OFA achieves high sparsity (85–90%) while maintaining competitive transfer performance on SQuADv1.1 and GLUE tasks compared to dense baselines and prior pruning methods.
Using KD during transfer improves results; combining KD with LRR and pattern-lock yields further gains, with minimal accuracy degradation on most tasks.
Quantization-aware training on sparse models reduces accuracy by a small margin (average ~0.67% relative to full-precision sparse models) and yields substantial size reductions, improving the compression-to-accuracy ratio.
For BERT-Large at 90% sparsity, results are within ~1% accuracy loss across most tasks, and even surpass the dense BERT-Base in terms of parameter efficiency (non-zero parameter count).
The authors release their compression library and sparse pre-trained models to facilitate reproducible research in model pruning and compression.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.