QUICK REVIEW

[Paper Review] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Tim Dettmers, Ruslan Svirschevski|arXiv (Cornell University)|Jun 5, 2023

Topic Modeling25 citations

TL;DR

SpQR introduces a hybrid sparse-quantized format that isolates a small set of high-sensitivity outlier weights kept in higher precision, while quantizing the rest to 3-4 bits, enabling near-lossless compression of large language models with efficient GPU decoding.

ABSTRACT

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.

Motivation & Objective

Investigate why standard low-bit quantization degrades quality for LLMs, especially in the 1-10B parameter range.
Propose a hybrid sparse-quantized representation that preserves accuracy by treating outliers separately.
Develop efficient encoding/decoding algorithms and a GPU-accelerated runtime for SpQR.
Demonstrate near-lossless compression (≤1% perplexity loss) across model scales from 7B to 65B parameters.
Evaluate memory and speed benefits compared with existing PTQ methods.

Proposed method

Identify outlier weights whose quantization induces disproportionately large output errors and store them at higher precision (16-bit).
Apply very small-group quantization (β1 ≈ 8-32) to base weights with second-level quantization (β2 ≈ 16) for statistics, enabling bilevel quantization.
Quantize base weights to 3-4 bits while encoding outliers separately in a CSR-like sparse format.
Quantize the first- and second-level statistics themselves using the same quantization pipeline (3-bit scales/zero-points for small groups).
Use an extended PTQ approach inspired by GPTQ with a two-step process: outlier detection via sensitivity (Eq. 2) and base-weight quantization, followed by assembling sparse outliers and metadata.
Provide a GPU decoding algorithm combining dense 16-bit dequantization with a CSR-based outlier handling for token-by-token generation.

Experimental results

Research questions

RQ1Can SpQR achieve near-lossless compression (≤1% perplexity loss) while reducing model size to 3-4 bits per parameter for diverse LLMs?
RQ2How does isolating outliers and using tiny-group bilevel quantization impact language modeling perplexity and zero-shot tasks compared with RTN and GPTQ?
RQ3What memory and compute benefits (e.g., speedups, memory footprint) does SpQR provide for GPU inference on large models?

Key findings

SpQR achieves relative perplexity losses of less than 1% on highly accurate LLaMA and Falcon models when quantized to 3-4 bits per parameter.
SpQR compresses LLMs by a factor of about 3.4x or more without accuracy degradation and can run 33B parameter models on a 24 GB GPU with ~15% speedup over 16-bit baselines.
Compared to GPTQ and RTN baselines at similar model sizes, SpQR yields substantially better perplexity and zero-shot performance, with improvements up to the magnitude of gains GPTQ over RTN.
Using 4-bit base quantization, SpQR can closely match or surpass the performance of state-of-the-art baselines across LLaMA and Falcon families, often halving the error relative to the 16-bit baseline.
Outliers (about 1% of weights) are kept in 16-bit and stored in a CSR-like sparse structure, enabling efficient decoding on GPUs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.