QUICK REVIEW

[论文解读] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Tim Dettmers, Ruslan Svirschevski|arXiv (Cornell University)|Jun 5, 2023

Topic Modeling被引用 25

一句话总结

SpQR 引入一种混合稀疏-量化格式，将保持在更高精度的一小组高灵敏度离群权重分离出来，其余部分量化至 3-4 位，使大语言模型的近失真压缩成为可能，GPU 解码高效。

ABSTRACT

Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especially for smaller models in the 1-10B parameter range, which are well-suited for edge deployments. To address this accuracy issue, we introduce the Sparse-Quantized Representation (SpQR), a new compressed format and quantization technique which enables for the first time near-lossless compression of LLMs across model scales, while reaching similar compression levels to previous methods. SpQR works by identifying and isolating outlier weights, which cause particularly-large quantization errors, and storing them in higher precision, while compressing all other weights to 3-4 bits, and achieves relative accuracy losses of less than 1% in perplexity for highly-accurate LLaMA and Falcon LLMs. This makes it possible to run 33B parameter LLM on a single 24 GB consumer GPU without any performance degradation at 15% speedup thus making powerful LLMs available to consumer without any downsides. SpQR comes with efficient algorithms for both encoding weights into its format, as well as decoding them efficiently at runtime. Specifically, we provide an efficient GPU inference algorithm for SpQR which yields faster inference than 16-bit baselines at similar accuracy, while enabling memory compression gains of more than 4x.

研究动机与目标

研究为何标准低位量化会降低大型语言模型的质量，尤其是在 1B 到 10B 参数范围内。
提出一种混合稀疏-量化表示，通过单独处理离群值来保留准确性。
开发高效的编码/解码算法以及面向 SpQR 的 GPU 加速运行时。
在从 7B 到 65B 参数的模型规模上演示近失真压缩（困惑度损失 ≤1%）。
评估相较于现有 PTQ 方法的内存和速度优势。

提出的方法

识别量化会导致输出误差异常增大的离群权重，并以更高精度（16 位）存储。
对基权重应用极小组量化（β1 约等于 8-32），对统计量进行二级量化（β2 约等于 16），实现双层量化。
将基权重量化到 3-4 位，同时以 CSR 风格的稀疏格式单独编码离群值。
使用相同的量化流程对第一层和第二层统计量进行量化（小分组使用 3 位尺度/零点）。
采用受 GPTQ 启发的扩展 PTQ 方法，包含两步流程：通过灵敏度（式（Eq. 2））进行离群值检测和基权重量化，随后组装稀疏离群值及元数据。
提供一个 GPU 解码算法，将密集16位去量化与基于 CSR 的离群值处理相结合，实现逐 token 生成。

实验结果

研究问题

RQ1SpQR 是否能够在将模型大小降至每参数 3-4 位的情况下实现近失真压缩（≤1% 的困惑度损失），适用于多样化的 LLM？
RQ2将离群值隔离并使用极小组双层量化相较 RTN 和 GPTQ 对语言建模困惑度和零-shot 任务的影响如何？
RQ3SpQR 在大型模型的 GPU 推理中提供哪些内存和计算方面的好处（如加速、内存占用）？

主要发现

在高精度的 LLaMA 和 Falcon 模型中，当参数量化到 3-4 位时，SpQR 实现了相对困惑度损失小于 1%。
SpQR 将 LLM 的压缩比提升约 3.4 倍或更多且不损失准确性，并且可以在 24 GB GPU 上运行 33B 参数模型，比 16 位基线快约 15%。
与同等模型规模的 GPTQ 和 RTN 基线相比，SpQR 在困惑度和零-shot 性能上显著更好，提升幅度达到 GPTQ 相对于 RTN 的增益级别。
使用 4 位基量化，SpQR 能在 LLaMA 和 Falcon 家族中接近或超越最先进基线的性能，通常使误差相对于 16 位基线减半。
离群值（约 1% 的权重）保持为 16 位并存储在 CSR 类稀疏结构中，使在 GPU 上的解码高效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。