QUICK REVIEW

[论文解读] QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong|arXiv (Cornell University)|Oct 12, 2023

Topic Modeling被引用 11

一句话总结

QLLM 引入自适应通道重组和低秩误差修正，使大型语言模型的后训练量化在 4–6 位下实现准确性，从而在零样本准确率和效率方面超过先前方法。

ABSTRACT

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

研究动机与目标

通过量化来降低内存和计算需求，从而推动高效部署大语言模型。
解决在极低位宽范围内限制 PTQ 精度的激活异常值。
提出一种无梯度的通道重组框架，将离群幅值在通道之间重新分配。
引入基于梯度的低秩误差修正机制，以进一步减小量化误差。
在 LLaMA-1 与 LLaMA-2 系列在多任务上展示可扩展性与有效性。

提出的方法

自适应通道重组，包括将离群通道拆分为子通道的通道拆解，以及将相似通道合并的通道组装，以保持原始通道数。
自适应策略，通过最小化重组误差自动确定每层的子通道数量。
使用低秩矩阵 A ∈ R^{M×r} 与 B ∈ R^{r×N} 的梯度型误差修正，添加到每个投影后，使用少量校准集训练并在训练后融合进权重。
通过迭代重建连续的 Attention-FFN 块来减缓量化误差的累计。
推理友好设计，重组后的通道可以在不增加推理成本的情况下被融合回量化权重。
按顺序进行的渐进重建，以考虑量化误差的传播。

Figure 1: An illustration of the channel-wise maximum and minimum values for the input activations of a linear layer in LLaMA-65B for (a) original pre-trained model (b) after SmoothQuant (Xiao et al., 2023 ) and (c) after our channel reassembly.

实验结果

研究问题

RQ1自适应通道重组是否能够有效抑制激活异常值，从而在 4–6 位下提升 LLM 的 PTQ 精度？
RQ2如何自动选择逐层的重组比率，以在抑制离群值和保留信息之间取得平衡？
RQ3添加少量可学习的低秩权重是否能够在不产生高额训练成本的情况下提升量化 LLM 的性能？
RQ4与现有 PTQ 方法相比，QLLM 对大模型的训练和推理效率有何影响？
RQ5在低位宽下，QLLM 在 LLaMA-1 和 LLaMA-2 的零样本任务与困惑度基准上表现如何？

主要发现

在 4-bit 量化下，QLLM 在零样本准确率和困惑度方面显著优于先前的 PTQ 方法，尤其中大型模型表现突出。
使用 QLLM 对 4-bit LLaMA-1-65B 进行量化在五个零样本任务中平均领先 OmniQuant 3.42 百分点。
对于 LLaMA-7B，QLLM 在零样本任务中的平均正确率可超过 QAT 基线（LLM-QAT + SQ）8.6%。
QLLM 在单个 A100-80G GPU 上用 10 小时对 4-bit LLaMA-2-70B 进行量化，展示出强劲的效率。
训练时间对比显示，在多种配置下，QLLM 需要的时间显著少于 OmniQuant（例如：LLaMA-2-70B：9.05 vs 14.52 GPU 小时）。
采用自适应阈值的通道重组（拆解+组装）在性能提升上几乎实现无损，优于先前的离群处理方法，尤其是在 4-bit 量化时。
低秩误差修正通过少量可训练参数进一步降低量化误差，使多块重建成为可能以缓解误差累积。

Figure A: An illustration of the searched expansion ratios using our adaptive strategy for 4-bit LLaMA-1-13B.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。