[论文解读] ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
该论文表明在对大型语言模型进行后训练量化(PTQ)时,FP8 激活与 FP4 权重可超越传统的 INT8/INT4,LoRC 有助于小模型,缩放因子约束带来的损失微不足道。
In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.
研究动机与目标
- Motivate and evaluate floating-point PTQ (FP8/FP4) for LLMs to address activation outliers and distribution skewness.
- Compare FP8/FP4 quantization against INT8/INT4 baselines across large language models (LLaMA and OPT) and datasets.
- Investigate techniques to mitigate W/A precision misalignment (scaling constraints, bit-shifting) and measure impacts on model quality.
- Assess the effectiveness of Low Rank Compensation (LoRC) in reducing quantization errors, especially for smaller models.
提出的方法
- Adopts GPTQ-based optimization for weight and token-wise activation quantization in a FP8/FP4 setting.
- Investigates FP8 activation versus INT8 activation, and FP4 versus INT4 weight quantization across LLaMA and OPT model families.
- Introduces two scaling constraints for weight quantization (powers of two) to ease FP4-to-FP8 casting and reduce overhead.
- Implements Low Rank Compensation (LoRC) to further reduce quantization errors especially in smaller models.
- Evaluates casting strategies for W4A8 on hardware like NVIDIA H100, proposing bit-shifting methods to align FP4 weights with FP8 activations.
- Provides ablation and comparative results across datasets (WikiText-2, PTB, C4) and model sizes.

实验结果
研究问题
- RQ1Does FP8 activation consistently outperform INT8 activation for LLMs, particularly as model size increases beyond 1B parameters?
- RQ2Can FP4 weight quantization match or beat INT4 in performance, and how does LoRC influence this comparison?
- RQ3Do two proposed scaling-constraint methods (power-of-two scales) preserve model quality in FP4/FP8 W4A8 quantization?
- RQ4What is the impact of applying LoRC on W4A8 quantization across model scales?
- RQ5How do FP8/FP4 quantization schemes compare to traditional INT-based quantization on standard LLM benchmarks and datasets?
主要发现
- FP8 activation generally outperforms INT8 activation for both LLaMA and OPT model families, with larger models showing more pronounced gains.
- FP8 weights are competitive with INT8, and FP4 weights can outperform INT4 in several cases, especially for mid-to-large models.
- LoRC improves W4A8 quantization, notably reducing quantization errors in smaller models.
- Constraining weight scales to powers of two (M1 or M2) has negligible impact on performance when LoRC is used, and M2 typically yields better results than M1.
- Casting FP4 to FP8 using a power-of-two scale strategy with LoRC helps maintain performance while enabling hardware-friendly implementations.
- Across datasets (WikiText-2, PTB, C4) and model sizes (LLaMA-3b to 30b, OPT-1.3b to 30b), FP8/FP4 configurations achieve lower perplexities than corresponding INT-based configurations in many configurations.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。