QUICK REVIEW

[論文レビュー] ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

Xiaoxia Wu, Zhewei Yao|arXiv (Cornell University)|Jul 19, 2023

Parallel Computing and Optimization Techniques被引用数 8

ひとこと要約

この論文は、LMMのポストトレーニング量子化においてFP8活性化とFP4重みが従来のINT8/INT4を上回る可能性を示し、LoRCが小型モデルを改善し、スケーリング因子の制約が性能低下をほとんど生じさせないことを示す。

ABSTRACT

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.

研究の動機と目的

activationsの外れ値と分布の歪みを解決するためのLLMs向け浮動小数点PTQ（FP8/FP4）の動機付けと評価。
large language models（LLaMAおよびOPT）とデータセット全体で、FP8/FP4量子化とINT8/INT4のベースラインを比較。
W/A精度の不整合（スケーリング制約、ビットシフト）を緩和する手法を調査し、モデル品質への影響を測定。
特に小型モデルで量子化誤差を低減するためのLoRCの有効性を評価。

提案手法

FP8/FP4設定での重みとトークン単位の活性化量子化のためのGPTQベースの最適化を採用。
LLaMAおよびOPTモデルファミリ全体で、FP8活性化対INT8活性化、FP4対INT4重み量子化を調査。
FP4からFP8へのキャスティングを容易にしオーバーヘッドを低減するための重み量子化の2つのスケーリング制約（2のべき乗）を導入。
小型モデルでの量子化誤差をさらに低減するためのLoRCを実装。
NVIDIA H100などのハードウェア上でW4A8のキャスティング戦略を評価し、FP4重みをFP8活性化に整列させるビットシフト法を提案。
WikiText-2、PTB、C4といったデータセットおよびモデルサイズ（LLaMA-3b〜30b、OPT-1.3b〜30b）を横断したアブレーションと比較結果を提供。

Figure 1: Distribution of Activation values. The top, middle and bottom rows represents the distributions at the 2nd, 12th and final layer of the pretrained OPT-1.3b model. From the left to right columns, they are respectively for the linear modules attn.q_proj (same as attn.k_proj and attn.v_proj),

実験結果

リサーチクエスチョン

RQ1FP8活性化は、特にモデルサイズが1Bパラメータを超えて増大するにつれて、INT8活性化を一貫して上回るのか？
RQ2FP4重み量子化はINT4と同等かそれを上回ることができるのか、LoRCはこの比較にどう影響するのか？
RQ32つの提案されたスケーリング制約法（べき乗のスケール）は、FP4/FP8 W4A8量子化においてモデル品質を保持するのか？
RQ4LoRCをW4A8量子化に適用した場合、モデル規模ごとにどのような影響があるのか？
RQ5FP8/FP4量子化方式は、標準のINTベース量子化と比較して標準的なLLMベンチマークとデータセットでどのような結果になるのか？

主な発見

FP8活性化は、LLaMAおよびOPTの両方のモデルファミリで一般的にINT8活性化を上回り、特に大きなモデルで利益が顕著。
FP8重みはINT8と競合しており、FP4重みはINT4を上回るケースが多く、特に中〜大規模モデルで顕著。
LoRCはW4A8量子化を改善し、特に小型モデルで量子化誤差を低減。
重みスケールを2のべき乗に制約する（M1またはM2）は、LoRCを使用する場合性能への影響が小さく、通常はM2の方がM1より良い結果を示す。
LoRCを用いた2のべき乗スケール戦略でFP4をFP8へキャストすることは、性能を維持しつつハードウェアに適した実装を可能にする。
データセット（WikiText-2、PTB、C4）およびモデルサイズ（LLaMA-3b〜30b、OPT-1.3b〜30b）全体で、FP8/FP4構成が多くの設定で対応するINTベース構成より低いパープレキシティを達成。

Figure 2: A Contrast between INT8 and FP8 Quantization Methods. The top row displays the original vector in its full-precision form. The subsequent row showcases the vector after quantization through the INT8 Asymmetric approach. The final two rows present values quantized by the FP8 method, utilizi

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。