Skip to main content
QUICK REVIEW

[论文解读] High-Rate Quantized Matrix Multiplication: Theory and Practice

Or Ordentlich, Yury Polyanskiy|arXiv (Cornell University)|Jan 23, 2026
Stochastic Gradient Optimization Techniques被引用 0
一句话总结

该论文在两种设置下推导了量化矩阵乘法的信息论高速率极限,分析了常见量化器,并引入 WaterSIC 作为近似最优方案。

ABSTRACT

This work investigates the problem of quantized matrix multiplication (MatMul), which has become crucial for the efficient deployment of large language models (LLMs). We consider two settings: 1) Generic MatMul, where both matrices must be quantized (weight+activation quantization); and 2) weight-only quantization, where the second matrix is only known through covariance matrix $Σ_X$ of its columns. For each setting, we first review the fundamental information-theoretic tradeoff between quantization rate and distortion (high-rate theory), and then analyze the performance of several popular quantization schemes, comparing them to these fundamental limits. Specifically, we discuss rate loss (compared to information theoretic optima) of absmax INT and floating-point (FP) quantization, for which we also derive remarkably accurate heuristic approximations. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. This new scheme (termed ``WaterSIC'') only uses scalar INT quantizers, but its high-rate performance is basis free (it depends only on the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations) and is within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit (!). GPTQ's performance is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find GPTQ to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal (for high-rate quantization).

研究动机与目标

  • 将量化 MatMul 作为大语言模型部署瓶颈进行动机说明并量化速率-失真权衡。
  • 利用矩阵统计量建立通用 MatMul 与权重仅量化的基本高速率失真界限。
  • 对比信息论极限,评估常见量化器(absmax INT、FP、NVFP4)的性能,并推导实用近似。
  • 引入 WaterSIC,展示其相对理论极限的近似最优性能。

提出的方法

  • 为 A^T B 构建具有共享随机性和 A、B 编码器的速率-R 量化方案。
  • 在高速率假设下推导最差情况和平均高斯输入的失真界限。
  • 分析 absmax INT、FP、NVFP4 量化器,得到近似失真形式 K(i,j)·2·2^{-2R_eff}。
  • 开发利用协方差 Sigma_X 的权重仅量化框架,并推导基于水位分配的最优分配。
  • 提出 WaterSIC,一种逐坐标的速率分配方案,失真在信息论最优解的常数因子内。
  • 将 WaterSIC 与 GPTQ/LDLQ 联系起来,并讨论在基底依赖性与旋转不变性方面的影响。
(a) Histogram of rounding errors
(a) Histogram of rounding errors

实验结果

研究问题

  • RQ1通用量化 MatMul 的基本高速率—失真极限是什么?
  • RQ2在通用 MatMul 设置下,常见量化方案(INT、FP、NVFP4)相对于这些极限的表现如何?
  • RQ3在权重仅量化中,Sigma_X 的知识如何实现近似最优的速率分配与失真?
  • RQ4是否有一个实用方案(WaterSIC)在无需复杂向量量化的情况下达到近似最优失真?
  • RQ5旋转或基底选择如何影响诸如 GPTQ 的实用量化器在高速率条件下的表现?

主要发现

  • 在高速率量化下,A^T B 的(i,j)条目的最小可实现失真随 K(i,j)·2·2^{-2R} 变化,K(i,j) 取决于列范数。
  • 对 INT 与 FP 量化器,失真可近似为 K(i,j)·2·2^{-2R_eff},存在一个速率差 R−R_eff 表征次优性。
  • 在 Sigma_X 感知的权重仅量化下,失真接近水位分配的最优解,与信息论极限的差距在 WaterSIC 下不超过每条目 0.25 位。
  • GPTQ/LDLQ 的在实践中的表现接近 WaterSIC,且随机旋转可以使 GPTQ 在高速率量化下接近最优。
  • WaterSIC 的失真在信息论极限的乘法因子 2πe/12(≈0.25 位/条目)之内,并且对 Sigma_X 基底具有不变性。
(b) MSE as a function of scaling coefficient $\gamma$
(b) MSE as a function of scaling coefficient $\gamma$

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。