Skip to main content
QUICK REVIEW

[论文解读] SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao|arXiv (Cornell University)|May 26, 2024
Advanced Data Compression Techniques被引用 6
一句话总结

SpinQuant 学习旋转矩阵,以改进大语言模型中权重、激活值及 KV 缓存的后训练量化,在 LLaMA-2/3 模型的 4 位量化下显著缩小与全精度精度之间的差距。它使用 Cayley SGD 在 Stiefel 流形上优化旋转,同时不改变全精度输出。

ABSTRACT

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.

研究动机与目标

  • Motivate and address quantization challenges in LLMs due to outliers in weights and activations.
  • Introduce rotation-based invariant parameterizations that preserve full-precision outputs while enabling better quantization.
  • Propose learning rotations via Cayley SGD on the Stiefel manifold to minimize quantized loss on a calibration set.
  • Demonstrate improvements over existing PTQ methods across multiple LLaMA-2/3 model sizes and tasks.
  • Show compatibility with GPTQ and robustness through ablations across rotation components.

提出的方法

  • Parameterize rotations at multiple points in the transformer to reduce outliers without adding new parameters.
  • Absorb R1, R2 into weights where possible to keep full-precision outputs identical.
  • Use online Hadamard rotations R3 and R4 for KV-cache and certain blocks where absorption is not possible.
  • Optimize R1 and R2 on the Stiefel manifold using Cayley SGD to minimize quantization loss on a small calibration set.

实验结果

研究问题

  • RQ1Can learned rotations on the residual, attention, and KV-cache paths reduce outliers sufficiently to improve 4-bit quantization of LLMs?
  • RQ2Does optimizing rotations with Cayley SGD on the Stiefel manifold yield consistent gains over random rotations and Hadamard rotations?
  • RQ3How does SpinQuant perform relative to state-of-the-art PTQ methods (e.g., GPTQ, SmoothQuant, QuaRot) across LLaMA-2/3 models and 4-bit settings?
  • RQ4Is the rotation-based approach compatible with existing quantization pipelines and independent of full-precision network outputs?
  • RQ5What is the impact of individual rotation components (R1–R4) on quantization performance?

主要发现

  • Optimizing rotation matrices via Cayley SGD yields significantly better quantization performance than random rotations across multiple models and tasks.
  • In 4-bit W-A-KV quantization, SpinQuant reduces the gap to full precision to a few points (e.g., 2.9 points on LLaMA-2 7B) and outperforms QuaRot and SmoothQuant in zero-shot tasks.
  • Rotation-based quantization improves both activation and weight quantization by distributing outliers more evenly, lowering quantization error.
  • SpinQuant demonstrates strong improvements for hard-to-quantize models (LLaMA-3 8B/70B) and maintains compatibility with GPTQ.
  • Ablation studies show adding multiple rotations (R1–R4) generally improves accuracy, with R4 (online) providing notable gains for KV-related paths.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。