QUICK REVIEW

[论文解读] Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Sheng Shen, Zhen Dong|arXiv (Cornell University)|Sep 12, 2019

Topic Modeling参考文献 42被引用 52

一句话总结

本文提出 Q-BERT，一种基于 Hessian 的混合精度与分组量化方案，用于 BERT，在 SST-2、MNLI、CoNLL-03 与 SQuAD 等任务中实现最多 13x 的权重量化压缩，精度损失不超过 2.3%。嵌入层和编码器层采用不同的量化策略，分组量化进一步降低退化，SQuAD 是最具挑战性的任务。

ABSTRACT

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use a Hessian based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most $2.3\%$ performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to $13 imes$ compression of the model parameters, and up to $4 imes$ compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

研究动机与目标

降低边缘设备部署对 BERT 内存和延迟的需求，同时不造成不可接受的精度损失。
使用二阶 Hessian 信息分析微调后的 BERT，以指导量化决策。
提出面向编码器层的基于 Hessian 的混合精度方案，以及面向自注意力模块的分组量化方案。
在多项 NLP 任务中展示超低精度量化带来的显著压缩，同时以最小的性能下降实现。

提出的方法

使用每个编码器层的 Hessian 谱的前几特征值计算基于 Hessian 的敏感性。
从对 10% 数据计算的 Hessian 特征值分布中定义敏感性指标 Omega_i = mean(lambda_i) + std(lambda_i)。
通过将更高位数分配给更敏感的层来实现混合精度量化，基于 Omega_i 的排序。
引入分组量化，将每个稠密矩阵（例如 MHSA 的头部内部）划分为若干组，并为每组设定独立的量化范围。
对嵌入和编码器参数应用不同的量化方案并进行量化感知微调。
使用统一的 8 位激活方案，并与 DirectQ 基线进行对比以衡量准确性保留情况。

实验结果

研究问题

RQ1 Hessian 信息（前几特征值及其分布）与 BERT 层的量化敏感性之间的相关性如何？
RQ2在 Ultra-low 比特量化（2-4 位）下，基于 Hessian 分析的混合精度是否能够保持 BERT 的准确性？
RQ3分组量化在对 BERT 的自注意力和前馈组件进行量化时是否提升性能？
RQ4BERT 的哪些模块（嵌入与编码器层）对量化最为敏感，应如何量化？
RQ5为何与其他 NLP 任务相比，SQuAD 的量化更具挑战性？

主要发现

Q-BERT 实现最高 13× 的权重压缩，以及在 SST-2、MNLI、CoNLL-03 与 SQuAD 的情况下嵌入和激活尺寸降低约 4×，同时精度损失不超过 2.3%。
基于 Hessian 的混合精度（2/3 或 2/4 位）在比均匀 2 位量化更优，尤其对较深的层更显著；中间的编码器层最为敏感，而最后几层则更具鲁棒性。
分组量化（使用 128 组）在与分层量化相比时显著降低了准确性损失，超过一定组数后收益递减。
嵌入量化比编码器权重更为敏感，位置嵌入对保持性能尤为关键。
SQuAD 展现出更大的 Hessian 特征值方差和收敛时的负曲率，这与在超低精度下更大的精度损失相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。