[Paper Review] ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
This paper conducts a comprehensive PTQ study across model families (OPT and BLOOM) and sizes, compares weight-only, activation-only, and weight-activation quantization with RTN, GPTQ, ZeroQuant variants, and introduces LoRC to recover model quality with minimal size increase.
Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.
Motivation & Objective
- Assess how PTQ behaves across model sizes and families under weight-only, activation-only, and weight-and-activation quantization.
- Evaluate existing PTQ methods (RTN, GPTQ, ZeroQuant variants) for their ability to shrink model size while preserving accuracy.
- Identify sensitivity patterns between activation and weight quantization across models and sizes.
- Propose improvements to PTQ with a low-rank compensation technique to recover FP16-quality performance.
- Provide practical quantization guidelines by model size group.
Proposed method
- Apply weight-only, activation-only, and weight-and-activation quantization using RTN, GPTQ, ZeroQuant, and variants to OPT and BLOOM models (125M to 176B).
- Perform sensitivity analysis on activation vs weight quantization, including symmetric/asymmetric quantization and per-row/per-token schemes.
- Compare PTQ methods under optimized configurations to maximize size reduction while minimizing perplexity degradation.
- Introduce LoRC (Low Rank Compensation) by factorizing quantization error E = W - W_hat with SVD into low-rank U and V to augment quantized weights.
- Demonstrate LoRC with FGQ (fine-grained quantization) and quantify parameter overhead; analyze optimal low-rank dimension m.
- Offer practical quantization recommendations by model size and quantization setting.
Experimental results
Research questions
- RQ1Do LLMs of different sizes and pretraining data exhibit similar behavior under quantization?
- RQ2Are existing PTQ methods effectively minimizing LLM sizes without sacrificing accuracy?
- RQ3How do weight-only, activation-only, and weight-and-activation quantization compare across model families (OPT and BLOOM)?
- RQ4Can LoRC improve model quality recovery with minimal size increase when combined with FGQ and PTQ?
- RQ5What practical quantization settings are recommended for different model sizes?
Key findings
- Activation quantization is generally more sensitive to weight quantization across models; smaller models often outperform larger models in activation quantization.
- Existing PTQ methods struggle to reach original model quality for INT4 weight or INT4 weight with INT8 activation (W4A8) quantization.
- LoRC improves model quality with minimal parameter overhead by approximating quantization error with low-rank matrices; gains are larger when combined with FGQ.
- GPTQ tends to perform best for weight-only quantization, while ZeroQuant variants generally outperform for weight-and-activation quantization.
- Fine-grained quantization (FGQ) substantially reduces error, enabling Class -1 performance for larger models (≥10B) with 4-bit weight; activation block size and model size influence gains.
- LoRC can nearly recover FP16 quality for INT4 quantization, with optimal gains at low ranks (m ≈ 4–8).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.