QUICK REVIEW

[论文解读] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Ji-Fu Li, Manyi Zhang|arXiv (Cornell University)|Mar 17, 2026

Numerical Methods and Algorithms被引用 0

一句话总结

BATQuant 引入带 Global 和 Private Kronecker 分解以及分块裁剪的分块仿射变换，以在 MXFP4 的量化中实现最小性能损失，在 MLLMs 和 LLMs 中优于先前的 PTQ 方法，尤其在激进的低比特设置下表现出色。

ABSTRACT

Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

研究动机与目标

为 MLLMs/LLMs 的 4 位量化建立稳健的 MXFP 量化动机，解决全局旋转在跨块异常值传递与双峰分布中的失效问题。
开发一个与分块对齐的仿射变换，防止跨块能量传递并同时学习分布整形。
通过 Global 和 Private Kronecker（GPK）分解降低参数与运行时开销：P_i = B_i ⊗ A，其中 A 全局共享，B_i 每块私有，以减少参数量。
在 Aggressive MXFP 配置下，展示 BATQuant 在多模态与语言任务中的有效性。

提出的方法

引入分块仿射变换（BAT），使用与 MXFP 粒度对齐的分块对角矩阵 P（如 32 的块），以限制异常值传递。
应用 Global and Private Kronecker（GPK）分解：P_i = B_i ⊗ A，A 全局共享，B_i 每块私有，以降低参数量。
引入带分块可学习阈值的分块可学习裁剪，以适应局部统计。
通过在校准集上最小化层级量化误差来训练可学习参数：Θ_l^* = argmin_Θ_l E_X∼D_cal [||F_l(X) − F̂_l(X; Θ_l)||_2^2]。
将 BATQuant 集成到 Transformer：权重端离线融合与激活端在线应用；对某些组件使用 BF16，以及低位 GEMMs。
采用 MXFP 量化，块大小 g=32，并使 P 块大小与 g 对齐以实现对局部分布的精确重塑。

实验结果

研究问题

RQ1BATQuant 在 W4A4KV16 等激进配置下，仍能在 MXFP4 下保持高精度吗？现有方法在此时会退化吗？
RQ2带 GPK 与裁剪的分块仿射变换是否在 MLLMs、LLMs，以及多模态与语言任务中具有泛化性？
RQ3块大小对齐、GPK 配置及裁剪对量化性能与参数效率有何影响？

主要发现

Bits	Method	MME	OCRBench	DocVQA	RealWorldQA	VLMBlind	Recovery(%)
W4A8KV16	RTN	2294	883	94.72	69.80	70.99	97.43
W4A8KV16	QuaRot	2327	870	95.07	69.80	71.12	97.53
W4A8KV16	SpinQuant	2321	872	94.79	70.46	69.82	97.29
W4A8KV16	BRQ	2329	865	94.72	70.19	67.18	96.40
W4A8KV16	FlatQuant	2351	886	95.31	69.02	73.90	98.66
W4A8KV16	SmoothQuant	2349	885	94.81	70.06	69.46	97.61
W4A8KV16	GPTQ	2346	891	95.03	69.15	72.62	98.36
W4A8KV16	BATQuant	2386	893	95.55	70.20	73.14	99.29
W4A4KV16	RTN	2243	838	92.70	65.23	66.47	93.07
W4A4KV16	QuaRot	2189	810	93.47	64.97	57.62	89.69
W4A4KV16	SpinQuant	1994	801	91.79	65.36	60.23	88.32
W4A4KV16	BRQ	2147	805	92.94	66.14	62.14	90.74
W4A4KV16	FlatQuant	2231	873	94.10	65.62	68.86	94.79
W4A4KV16	SmoothQuant	2264	862	93.93	68.89	66.26	95.01
W4A4KV16	GPTQ	2286	849	93.98	66.93	67.29	94.64
W4A4KV16	BATQuant	2360	864	94.31	67.32	69.70	96.43
W4A8KV8	RTN	2208	878	94.64	69.54	71.01	96.51
W4A8KV8	QuaRot	2296	868	95.11	69.02	70.26	96.77
W4A8KV8	SpinQuant	2217	832	94.41	68.10	69.04	94.58
W4A8KV8	BRQ	2283	867	94.63	69.80	67.36	95.98
W4A8KV8	FlatQuant	2353	888	95.12	69.14	72.77	98.41
W4A8KV8	SmoothQuant	2317	884	94.72	70.19	68.91	97.19
W4A8KV8	GPTQ	2340	885	95.14	71.11	71.79	98.53
W4A8KV8	BATQuant	2368	890	95.47	69.93	72.82	98.89
W4A8KV4	RTN	2220	856	94.05	68.50	67.50	94.76
W4A8KV4	QuaRot	2280	857	94.66	68.52	68.36	95.65
W4A8KV4	SpinQuant	2248	829	94.18	68.63	64.50	93.65
W4A8KV4	BRQ	2236	841	94.07	68.63	66.03	94.20
W4A8KV4	FlatQuant	2293	884	94.88	68.76	70.75	97.11
W4A8KV4	SmoothQuant	2283	871	94.39	67.02	66.99	95.13
W4A8KV4	GPTQ	2328	867	94.15	68.10	70.81	96.71
W4A8KV4	BATQuant	2332	885	95.07	68.63	70.92	97.51

BATQuant 在 W4A8KV16 上实现近似无损性能，在评估基准上可恢复 BF16 性能的最高 99%。
在 W4A4KV16 下，BATQuant 在多模态基准上的平均恢复率为 96.43%，比 FlatQuant 提高 1.64%。
在 W4A8KV16、W4A8KV8 和 W4A8KV4 下，BATQuant 在 MLLMs 和 LLMs 上均持续超越基线，涵盖多模态与推理任务。
分块仿射变换防止跨块能量传递，缓解 Hadamard/旋转等方法导致的双峰分布问题。
GPK 分解使参数量相比 FlatQuant/Naive Kronecker 降低了 74%–79% 以上，同时通过 Kronecker 乘积的向量化保持高效推理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。