QUICK REVIEW

[论文解读] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

Yuhui Xu, Lingxi Xie|arXiv (Cornell University)|Sep 26, 2023

Topic Modeling被引用 21

一句话总结

QA-LoRA 通过联合量化并对适应权重进行控制，以实现低比特微调和部署，在保持量化推理的前提下，优于带 PTQ 的 QLoRA。

ABSTRACT

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.

研究动机与目标

通过将量化与参数高效微调相结合，推动降低大型语言模型的训练和推理成本。
提出一种按组量化策略，在增加量化自由度的同时约束适应自由度。
实现对低比特量化权重的微调，并将微调后的权重合并回量化模型，以实现高效部署。
展示在多个数据集和比特宽度下，对 LLaMA 与 LLaMA2 家族的适用性。

提出的方法

介绍 QA-LoRA，它将 W 的每个输入列划分为 L 个组并独立量化每个组。
通过在同一组内共享行向量来约束 LoRA 适应，减少适应参数数量。
在微调期间使用每列缩放/零因子将 W 量化为低比特表示，同时包含 LoRA 项 s*A*B。
将经适应的权重合并回量化形式 (W' = W~ + s*A*B)，无需额外的训练后量化。
提供类似 PyTorch 的实现，在标准的 LoRA/QLoRA 流程中只需添加几行代码。
使用按组量化在量化自由度和适应自由度之间取得平衡，在低比特宽度下提高准确性。

实验结果

研究问题

RQ1在对 LLM 进行低比特量化微调时，面向量化感知的低秩自适应能否维持或提升准确性？
RQ2按组量化是否足够增加量化自由度以抵消微调过程中的适应？
RQ3在准确性和推理/微调速度方面，QA-LoRA 相较于 LoRA 与 QLoRA（有/无 PTQ），在不同模型规模与数据集上表现如何？

主要发现

在 MMLU 与零-shot/少-shot 设置下，QA-LoRA 在不同模型规模和微调数据集上始终优于带 PTQ 的 QLoRA。
由于 INT4 量化以及训练后保持的量化表示，QA-LoRA 实现了比 QLoRA 更快的微调和推理。
与不带 PTQ 的 QLoRA 相比，QA-LoRA 维持竞争力或更高的准确性，同时避免了代价高昂的 PTQ 步骤。
在比特宽度更低（如 INT3 或 INT2）以及较小的基础模型上，QA-LoRA 的增益更明显。
该方法保持轻量且易于实现，只需进行少量代码变更。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。