QUICK REVIEW

[論文レビュー] QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong|arXiv (Cornell University)|Oct 12, 2023

Topic Modeling被引用数 11

ひとこと要約

QLLMは適応的チャネル再組み立てと低秩エラー補正を導入し、4–6ビットでの大規模言語モデルのポストトレーニング量子化を正確に可能にし、従来手法よりゼロショット精度と効率を向上させる。

ABSTRACT

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

研究の動機と目的

LLMのメモリと計算を量子化によって削減し、効率的な展開を動機づける。
極端な低ビット幅領域でPTQの精度を制限する活性化のアウトライヤーに対処する。
アウトライヤーの大きさをチャネル間で再分配する勾配なしのチャネル再組み立てフレームワークを提案する。
量子化誤差をさらに緩和する勾配ベースでない低秩エラー補正メカニズムを導入する。
複数のタスクにわたるLLaMA-1およびLLaMA-2ファミリでのスケーラビリティと有効性を実証する。

提案手法

チャネル分解でアウトライヤーチャネルをサブチャネルに分解し、チャネルアセンブリで類似チャネルを統合して元のチャネル数を維持する適応的チャネル再組み立て。
再組み立て誤差を最小化することによって各レイヤーのサブチャネル数を自動的に決定する適応戦略。
各射影に追加された低秩行列 A in R^{M×r} と B in R^{r×N} を用いた勾配ベースではない効率的な誤差補正で、訓練セットを小規模にして訓練後に重みへと統合。
連続する Attention-FFN ブロックを反復的に再構成して量子化誤差の蓄積を緩和するマルチブロック再構成。
推論に優しい設計で、再組み立てられたチャネルを追加の推論コストなしに量子化重みへと融合可能。
量子化誤差の伝搬を考慮して段階的に再構成を行う。

Figure 1: An illustration of the channel-wise maximum and minimum values for the input activations of a linear layer in LLaMA-65B for (a) original pre-trained model (b) after SmoothQuant (Xiao et al., 2023 ) and (c) after our channel reassembly.

実験結果

リサーチクエスチョン

RQ1Adaptive channel reassemblyは4–6ビットでのLLMのPTQ精度を向上させるために活性化アウトライヤーを効果的に抑制できるか？
RQ2アウトライヤー抑制と情報保持のバランスを取るために層ごとの再組み立て比を自動的に選択するにはどうすればよいか？
RQ3学習可能な低秩ウェイトの小さなセットを追加して、 heavyな訓練コストをかけずに量子化済みLLMの性能を改善できるか？
RQ4大規模モデルに対する従来のPTQ手法と比較して、QLLMの訓練・推論効率にはどのような影響があるか？
RQ5LLaMA-1およびLLaMA-2で低ビット幅におけるゼロショットタスクと困惑度ベンチマークでのQLLMの性能はどうか？

主な発見

4-bit量子化において、QLLMはゼロショット精度と困惑度で従来のPTQ手法を大きく上回り、特に大規模モデルで顕著である。
4-bit LLaMA-1-65BをQLLMで量子化すると、5つのゼロショットタスクにおいてOmniQuantを平均3.42ポイント上回る。
LLaMA-7Bでは、QLLMはQATベースライン（LLM-QAT + SQ）を平均精度で8.6%上回る。
QLLMは4-bitのLLaMA-2-70Bを1枚のA100-80G GPUで10時間以内に量子化し、高い効率を示す。
訓練時間比較では、QLLMは複数の構成でOmniQuantより著しく短い時間を要する（例：LLaMA-2-70B: 9.05 vs 14.52 GPU時間）。
再組み立て（分解+ 組み立て）と適応閾値処理は、従来のアウトライヤー処理法に対してほぼ喪失ゼロの性能向上をもたらし、特に4-bit量子化で効果を発揮する。
低秩エラー補正は、訓練可能パラメータの少数で量子化誤差をさらに減らし、多ブロック再構成を可能にして誤差蓄積を緩和する。

Figure A: An illustration of the searched expansion ratios using our adaptive strategy for 4-bit LLaMA-1-13B.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。