QUICK REVIEW

[论文解读] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Mengzhao Chen, Wenqi Shao|arXiv (Cornell University)|Jul 10, 2024

Topic Modeling被引用 6

一句话总结

EfficientQAT 引入一个两阶段量化框架（Block-AP 和 E2E-QP）以高效地压缩大型语言模型，在 70B 模型上实现强大的 2 位量化性能，代价是较小的精度损失并降低训练内存使用。

ABSTRACT

Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.

研究动机与目标

解决大型语言模型的量化感知训练高内存和高训练成本问题。
开发一种两阶段、内存高效的 QAT 方法，在初始化良好并在部署阶段训练最少参数。
展示在 7B 到 70B 参数的基础型、指令微调型和多模态 LLMs 上的鲁棒性与性能提升。

提出的方法

提出对所有参数的分块训练（Block-AP），以分块重构来训练每个变换器块，从而在不重新训练整个 LLM 的情况下实现全参数训练。
引入端到端量化参数训练（E2E-QP），在端到端地固定量化权重并仅优化量化参数（步长，及可选的零点）。
使用具有学习的 s（尺度）和 z（零点）的标准均匀量化对权重进行量化，并并入计算图以进行基于梯度的优化。
在量化组内共享 s 和 z 以降低内存和可训练参数数量。
证明 Block-AP 提供稳健的初始化，而 E2E-QP 通过量化骨干网络进一步提升量化表现。

实验结果

研究问题

RQ1Block-AP 是否能为后续对 LLM 的量化感知训练提供高效且有效的初始化？
RQ2在 Block-AP 初始化的骨干上对量化参数（s 和 z）进行端到端训练，是否在 2、3、4 位量化下提供更优的精度和效率？
RQ3EfficientQAT 相对于 PTQ、QAT 和 Q-PEFT 基线，在基础型、指令微调型和多模态 LLM（7B–70B）中的表现如何？

主要发现

EfficientQAT 在低比特量化下取得出色性能，包括在 Llama-2-70B 上的 2 位量化，精度下降约 3%（从 72.41 降至 69.48）。
Block-AP 提供稳健的初始化，与 E2E-QP 结合时，带来超越现有方法的量化结果。
E2E-QP 仅训练量化参数，显著降低训练过程中的内存使用（例如，在单个 A100-80GB GPU 上 2 位 70B 需要 34.2 GB）。
EfficientQAT 在 2–4 位下，跨基础型、指令微调型和多模态 LLMs 的表现超过了既定的 QAT 和 Q-PEFT 基线。
通过均匀量化实现推理加速，线性层在大模型上的 INT2 前向传播速度提升高达 4.4 倍。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。