QUICK REVIEW

[논문 리뷰] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Ji-Fu Li, Manyi Zhang|arXiv (Cornell University)|2026. 03. 17.

Numerical Methods and Algorithms인용 수 0

한 줄 요약

BATQuant는 Global 및 Private Kronecker 분해와 블록 단위 자르기(block-wise clipping)를 가진 블록 단위 선형 변환을 도입하여 MXFP4를 양자화하고, MLLMs 및 LLMs 전반에서 최소한의 성능 손실로 기존 PTQ 방법을 능가하며 특히 공격적인 저비트 설정에서 우수한 성능을 보입니다.

ABSTRACT

Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.

연구 동기 및 목표

글로벌 회전이 블록 간 이상치 전달 및 이분형 분포로 인해 실패하는 MLLMs/LLMs에 대한 강 robust 4비트 MXFP 양자화 동기 부여.
교차 블록 에너지 전달을 방지하면서 분포 형성을 학습하기 위한 블록 정렬된 선형 변환 개발.
Global 및 Private Kronecker (GPK) 분해로 매개변수 및 런타임 오버헤드 감소 및 남은 이상치를 억제하는 블록 단위 학습 가능한 클리핑 추가.
강화된 MXFP 구성에서 다중 모달 및 언어 작업에 걸친 BATQuant의 효과를 입증.

제안 방법

MXFP 세분성에 맞춰 블록 대각 P를 사용하여 이상치 전달을 제한하는 Block-wise Affine Transformation(BAT) 도입(블록은 예: 32).
Global 및 Private Kronecker(GPK) 분해 적용: P_i = B_i ⊗ A로 A는 전역적으로 공유되고 B_i는 블록당 개인화되어 매개변수를 줄임.
로컬 통계에 적응하기 위한 블록 단위 학습 가능한 임계값이 있는 블록 단위 학습 가능한 클리핑 도입.
교정 세트에서 계층별 양자화 오차를 최소화하여 학습 가능한 매개변수 학습: Θ_l^* = argmin_Θ_l E_X∼D_cal [||F_l(X) − F̂_l(X; Θ_l)||_2^2].
Transformer에 BATQuant를 통합: 가중치 측 오프라인 융합 및 활성화 측 온라인 적용; 특정 구성 요소에 BF16 사용 및 저비트 GEMM 사용.
g=32의 블록 크기로 MXFP 양자화를 채택하고 P 블록 크기를 g와 정렬해 국소 분포 재 shaping를 정밀하게 수행.

실험 결과

연구 질문

RQ1BATQuant가 기존 방법이 저하하는 MXFP4 및 관련 구성의 공격적 W4A4KV16에서 높은 정확도를 유지할 수 있는가?
RQ2GPK 및 클리핑이 포함된 블록 단위 선형 변환이 다중 모달 및 언어 작업을 포함한 MLLMs와 LLMs 전반에 일반화되는가?
RQ3블록 크기 정렬, GPK 구성 및 클리핑이 양자화 성능 및 매개변수 효율성에 미치는 영향은 무엇인가?

주요 결과

비트	방법	MME	OCRBench	DocVQA	RealWorldQA	VLMBlind	회복력(%)
W4A8KV16	RTN	2294	883	94.72	69.80	70.99	97.43
W4A8KV16	QuaRot	2327	870	95.07	69.80	71.12	97.53
W4A8KV16	SpinQuant	2321	872	94.79	70.46	69.82	97.29
W4A8KV16	BRQ	2329	865	94.72	70.19	67.18	96.40
W4A8KV16	FlatQuant	2351	886	95.31	69.02	73.90	98.66
W4A8KV16	SmoothQuant	2349	885	94.81	70.06	69.46	97.61
W4A8KV16	GPTQ	2346	891	95.03	69.15	72.62	98.36
W4A8KV16	BATQuant	2386	893	95.55	70.20	73.14	99.29
W4A4KV16	RTN	2243	838	92.70	65.23	66.47	93.07
W4A4KV16	QuaRot	2189	810	93.47	64.97	57.62	89.69
W4A4KV16	SpinQuant	1994	801	91.79	65.36	60.23	88.32
W4A4KV16	BRQ	2147	805	92.94	66.14	62.14	90.74
W4A4KV16	FlatQuant	2231	873	94.10	65.62	68.86	94.79
W4A4KV16	SmoothQuant	2264	862	93.93	68.89	66.26	95.01
W4A4KV16	GPTQ	2286	849	93.98	66.93	67.29	94.64
W4A4KV16	BATQuant	2360	864	94.31	67.32	69.70	96.43
W4A8KV8	RTN	2208	878	94.64	69.54	71.01	96.51
W4A8KV8	QuaRot	2296	868	95.11	69.02	70.26	96.77
W4A8KV8	SpinQuant	2217	832	94.41	68.10	69.04	94.58
W4A8KV8	BRQ	2283	867	94.63	69.80	67.36	95.98
W4A8KV8	FlatQuant	2353	888	95.12	69.14	72.77	98.41
W4A8KV8	SmoothQuant	2317	884	94.72	70.19	68.91	97.19
W4A8KV8	GPTQ	2340	885	95.14	71.11	71.79	98.53
W4A8KV8	BATQuant	2368	890	95.47	69.93	72.82	98.89
W4A8KV4	RTN	2220	856	94.05	68.50	67.50	94.76
W4A8KV4	QuaRot	2280	857	94.66	68.52	68.36	95.65
W4A8KV4	SpinQuant	2248	829	94.18	68.63	64.50	93.65
W4A8KV4	BRQ	2236	841	94.07	68.63	66.03	94.20
W4A8KV4	FlatQuant	2293	884	94.88	68.76	70.75	97.11
W4A8KV4	SmoothQuant	2283	871	94.39	67.02	66.99	95.13
W4A8KV4	GPTQ	2328	867	94.15	68.10	70.81	96.71
W4A8KV4	BATQuant	2332	885	95.07	68.63	70.92	97.51

BATQuant는 W4A8KV16에서 거의 손실 없는 성능에 근접하며 평가된 벤치마크에서 BF16 성능의 최대 99%를 회복.
W4A4KV16에서 BATQuant는 다중 모달 벤치마크에서 평균 회복 96.43%를 달성하여 FlatQuant 대비 1.64% 포인트 우수.
BATQuant는 다중 모달 및 추론 과제를 포함한 MLLMs와 LLMs 전반에서 W4A8KV16, W4A8KV8, W4A8KV4에서 일관되게 베이스라인보다 우수하며, 블록 간 선형 변환으로 양상 분포를 완화.
블록 단위 선형 변환은 교차 블록 에너지 전이를 방지하고 Hadamard/회전 기반 방법에서 발생하는 이분형 분포를 완화.
GPK 분해는 FlatQuant/Naive Kronecker에 비해 매개변수 수를 74%~79% 이상 감소시키면서 Kronecker 곱의 벡터화로 추론을 유지합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.