[论文解读] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization
BATQuant 引入带 Global 和 Private Kronecker 分解以及分块裁剪的分块仿射变换,以在 MXFP4 的量化中实现最小性能损失,在 MLLMs 和 LLMs 中优于先前的 PTQ 方法,尤其在激进的低比特设置下表现出色。
Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
研究动机与目标
- 为 MLLMs/LLMs 的 4 位量化建立稳健的 MXFP 量化动机,解决全局旋转在跨块异常值传递与双峰分布中的失效问题。
- 开发一个与分块对齐的仿射变换,防止跨块能量传递并同时学习分布整形。
- 通过 Global 和 Private Kronecker(GPK)分解降低参数与运行时开销:P_i = B_i ⊗ A,其中 A 全局共享,B_i 每块私有,以减少参数量。
- 在 Aggressive MXFP 配置下,展示 BATQuant 在多模态与语言任务中的有效性。
提出的方法
- 引入分块仿射变换(BAT),使用与 MXFP 粒度对齐的分块对角矩阵 P(如 32 的块),以限制异常值传递。
- 应用 Global and Private Kronecker(GPK)分解:P_i = B_i ⊗ A,A 全局共享,B_i 每块私有,以降低参数量。
- 引入带分块可学习阈值的分块可学习裁剪,以适应局部统计。
- 通过在校准集上最小化层级量化误差来训练可学习参数:Θ_l^* = argmin_Θ_l E_X∼D_cal [||F_l(X) − F̂_l(X; Θ_l)||_2^2]。
- 将 BATQuant 集成到 Transformer:权重端离线融合与激活端在线应用;对某些组件使用 BF16,以及低位 GEMMs。
- 采用 MXFP 量化,块大小 g=32,并使 P 块大小与 g 对齐以实现对局部分布的精确重塑。
实验结果
研究问题
- RQ1BATQuant 在 W4A4KV16 等激进配置下,仍能在 MXFP4 下保持高精度吗?现有方法在此时会退化吗?
- RQ2带 GPK 与裁剪的分块仿射变换是否在 MLLMs、LLMs,以及多模态与语言任务中具有泛化性?
- RQ3块大小对齐、GPK 配置及裁剪对量化性能与参数效率有何影响?
主要发现
| Bits | Method | MME | OCRBench | DocVQA | RealWorldQA | VLMBlind | Recovery(%) |
|---|---|---|---|---|---|---|---|
| W4A8KV16 | RTN | 2294 | 883 | 94.72 | 69.80 | 70.99 | 97.43 |
| W4A8KV16 | QuaRot | 2327 | 870 | 95.07 | 69.80 | 71.12 | 97.53 |
| W4A8KV16 | SpinQuant | 2321 | 872 | 94.79 | 70.46 | 69.82 | 97.29 |
| W4A8KV16 | BRQ | 2329 | 865 | 94.72 | 70.19 | 67.18 | 96.40 |
| W4A8KV16 | FlatQuant | 2351 | 886 | 95.31 | 69.02 | 73.90 | 98.66 |
| W4A8KV16 | SmoothQuant | 2349 | 885 | 94.81 | 70.06 | 69.46 | 97.61 |
| W4A8KV16 | GPTQ | 2346 | 891 | 95.03 | 69.15 | 72.62 | 98.36 |
| W4A8KV16 | BATQuant | 2386 | 893 | 95.55 | 70.20 | 73.14 | 99.29 |
| W4A4KV16 | RTN | 2243 | 838 | 92.70 | 65.23 | 66.47 | 93.07 |
| W4A4KV16 | QuaRot | 2189 | 810 | 93.47 | 64.97 | 57.62 | 89.69 |
| W4A4KV16 | SpinQuant | 1994 | 801 | 91.79 | 65.36 | 60.23 | 88.32 |
| W4A4KV16 | BRQ | 2147 | 805 | 92.94 | 66.14 | 62.14 | 90.74 |
| W4A4KV16 | FlatQuant | 2231 | 873 | 94.10 | 65.62 | 68.86 | 94.79 |
| W4A4KV16 | SmoothQuant | 2264 | 862 | 93.93 | 68.89 | 66.26 | 95.01 |
| W4A4KV16 | GPTQ | 2286 | 849 | 93.98 | 66.93 | 67.29 | 94.64 |
| W4A4KV16 | BATQuant | 2360 | 864 | 94.31 | 67.32 | 69.70 | 96.43 |
| W4A8KV8 | RTN | 2208 | 878 | 94.64 | 69.54 | 71.01 | 96.51 |
| W4A8KV8 | QuaRot | 2296 | 868 | 95.11 | 69.02 | 70.26 | 96.77 |
| W4A8KV8 | SpinQuant | 2217 | 832 | 94.41 | 68.10 | 69.04 | 94.58 |
| W4A8KV8 | BRQ | 2283 | 867 | 94.63 | 69.80 | 67.36 | 95.98 |
| W4A8KV8 | FlatQuant | 2353 | 888 | 95.12 | 69.14 | 72.77 | 98.41 |
| W4A8KV8 | SmoothQuant | 2317 | 884 | 94.72 | 70.19 | 68.91 | 97.19 |
| W4A8KV8 | GPTQ | 2340 | 885 | 95.14 | 71.11 | 71.79 | 98.53 |
| W4A8KV8 | BATQuant | 2368 | 890 | 95.47 | 69.93 | 72.82 | 98.89 |
| W4A8KV4 | RTN | 2220 | 856 | 94.05 | 68.50 | 67.50 | 94.76 |
| W4A8KV4 | QuaRot | 2280 | 857 | 94.66 | 68.52 | 68.36 | 95.65 |
| W4A8KV4 | SpinQuant | 2248 | 829 | 94.18 | 68.63 | 64.50 | 93.65 |
| W4A8KV4 | BRQ | 2236 | 841 | 94.07 | 68.63 | 66.03 | 94.20 |
| W4A8KV4 | FlatQuant | 2293 | 884 | 94.88 | 68.76 | 70.75 | 97.11 |
| W4A8KV4 | SmoothQuant | 2283 | 871 | 94.39 | 67.02 | 66.99 | 95.13 |
| W4A8KV4 | GPTQ | 2328 | 867 | 94.15 | 68.10 | 70.81 | 96.71 |
| W4A8KV4 | BATQuant | 2332 | 885 | 95.07 | 68.63 | 70.92 | 97.51 |
- BATQuant 在 W4A8KV16 上实现近似无损性能,在评估基准上可恢复 BF16 性能的最高 99%。
- 在 W4A4KV16 下,BATQuant 在多模态基准上的平均恢复率为 96.43%,比 FlatQuant 提高 1.64%。
- 在 W4A8KV16、W4A8KV8 和 W4A8KV4 下,BATQuant 在 MLLMs 和 LLMs 上均持续超越基线,涵盖多模态与推理任务。
- 分块仿射变换防止跨块能量传递,缓解 Hadamard/旋转等方法导致的双峰分布问题。
- GPK 分解使参数量相比 FlatQuant/Naive Kronecker 降低了 74%–79% 以上,同时通过 Kronecker 乘积的向量化保持高效推理。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。