[Paper Review] SpinQuant: LLM quantization with learned rotations
SpinQuant learns rotation matrices to improve post-training quantization of weights, activations, and KV caches in LLMs, dramatically narrowing the gap to full-precision accuracy for 4-bit quantization on LLaMA-2/3 models. It uses Cayley SGD to optimize rotations on the Stiefel manifold without altering full-precision outputs.
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.
Motivation & Objective
- Motivate and address quantization challenges in LLMs due to outliers in weights and activations.
- Introduce rotation-based invariant parameterizations that preserve full-precision outputs while enabling better quantization.
- Propose learning rotations via Cayley SGD on the Stiefel manifold to minimize quantized loss on a calibration set.
- Demonstrate improvements over existing PTQ methods across multiple LLaMA-2/3 model sizes and tasks.
- Show compatibility with GPTQ and robustness through ablations across rotation components.
Proposed method
- Parameterize rotations at multiple points in the transformer to reduce outliers without adding new parameters.
- Absorb R1, R2 into weights where possible to keep full-precision outputs identical.
- Use online Hadamard rotations R3 and R4 for KV-cache and certain blocks where absorption is not possible.
- Optimize R1 and R2 on the Stiefel manifold using Cayley SGD to minimize quantization loss on a small calibration set.
Experimental results
Research questions
- RQ1Can learned rotations on the residual, attention, and KV-cache paths reduce outliers sufficiently to improve 4-bit quantization of LLMs?
- RQ2Does optimizing rotations with Cayley SGD on the Stiefel manifold yield consistent gains over random rotations and Hadamard rotations?
- RQ3How does SpinQuant perform relative to state-of-the-art PTQ methods (e.g., GPTQ, SmoothQuant, QuaRot) across LLaMA-2/3 models and 4-bit settings?
- RQ4Is the rotation-based approach compatible with existing quantization pipelines and independent of full-precision network outputs?
- RQ5What is the impact of individual rotation components (R1–R4) on quantization performance?
Key findings
- Optimizing rotation matrices via Cayley SGD yields significantly better quantization performance than random rotations across multiple models and tasks.
- In 4-bit W-A-KV quantization, SpinQuant reduces the gap to full precision to a few points (e.g., 2.9 points on LLaMA-2 7B) and outperforms QuaRot and SmoothQuant in zero-shot tasks.
- Rotation-based quantization improves both activation and weight quantization by distributing outliers more evenly, lowering quantization error.
- SpinQuant demonstrates strong improvements for hard-to-quantize models (LLaMA-3 8B/70B) and maintains compatibility with GPTQ.
- Ablation studies show adding multiple rotations (R1–R4) generally improves accuracy, with R4 (online) providing notable gains for KV-related paths.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.