QUICK REVIEW

[论文解读] ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers

Zhewei Yao, Reza Yazdani Aminabadi|arXiv (Cornell University)|Jun 4, 2022

Advanced Neural Network Applications被引用 72

一句话总结

ZeroQuant 提供端到端的 PTQ 管线，具有细粒度权重和激活量化、一个轻量级逐层知识蒸馏（LKD），以及优化后的后端，能够为大模型变换器提供 INT8/INT4-INT8 混合精度，同时尽量减少精度损失并实现显著的速度提升。

ABSTRACT

How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements. In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (LKD) even without the access to the original training data; (3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead. As such, we are able to show that: (1) ZeroQuant can reduce the precision for weights and activations to INT8 in a cost-free way for both BERT and GPT3-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on those models compared to FP16 inference; (2) ZeroQuant plus LKD affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model; (3) ZeroQuant can be directly applied to two of the largest open-sourced language models, including GPT-J6B and GPT-NeoX20, for which our INT8 model achieves similar accuracy as the FP16 model but achieves up to 5.2x better efficiency.

研究动机与目标

在内存和计算约束下，推动部署日益大的 NLP 模型的必要性。
提出一种后训练量化管线，避免再训练并保持准确性。
引入一种面向硬件的量化方案，采用细粒度的权重/激活量化。
提出一种轻量级的逐层知识蒸馏方法，且无需训练数据即可工作。
展示在推理过程中最小化量化开销的系统级优化。

提出的方法

应用细粒度的硬件友好量化：分组权重量化和 token 级激活量化。
引入 Layer-by-layer Knowledge Distillation (LKD)，逐层量化，每一层使用原始未量化层作为教师。
开发高度优化的推理后端，将 token 级量化与前序运算融合以降低数据移动。
利用 CUTLASS-based INT8 GeMM 内核和内核融合，以最小化量化/反量化开销。
在 BERT 和 GPT-3 风格模型上展示权重/激活的 INT8 量化，精度损失极小，以及带 LKD 的 INT4/INT8 混合精度。
展示对 GPT-J (6B) 和 GPT-NeoX (20B) 的可扩展性，并带来显著的效率提升。

实验结果

研究问题

RQ1在不访问训练数据的情况下，后训练量化是否能把大尺寸 Transformer 模型量化到 INT8 或 INT4，并保持最小的精度损失？
RQ2逐层蒸馏方法（LKD）是否能够在不完全重新训练或访问原始数据的前提下实现超低精度量化？
RQ3哪些面向硬件的量化策略（分组权重、token 级激活）能在大规模变换器上实现最佳精度/延迟权衡？
RQ4系统级优化（内核融合、后端）在为量化变换器带来实际低延迟方面有多有效？
RQ5ZeroQuant 是否能够扩展到十亿参数的模型并在实现显著吞吐量提升的同时保持竞争力的准确度？

主要发现

INT8 quantization of weights and activations for BERT and GPT-3-style models yields substantial speedups with minimal accuracy loss compared to FP16 (up to 5.19x on BERT-base and 4.16x on GPT-3-350M).
LKD enables INT4/INT8 mixed-precision quantization with ~3x memory footprint reduction versus FP16, with modest accuracy loss and fast quantization (e.g., ~33s for BERT-base quantization).
ZeroQuant-LKD achieves strong results on GPT-J-6B and GPT-NeoX-20B, delivering up to 5.2x efficiency gains and reduced GPU requirements/latency (e.g., GPT-NeoX-20B from 2 GPUs to 1, latency 65ms to 25ms).
Kernel fusion and a CUTLASS-based INT8 GeMM backend substantially reduce quantization/dequantization overhead and improve latency for INT8 transformer inference.
On GPT-3-style models, accuracy is more robust for accuracy tasks than generation tasks under quantization, with ZeroQuant narrowing the gap to PTQ (W8A8) and outperforms W4/8 schemes with LKD.
Ablation studies show group-wise weight quantization plus token-wise activation quantization yields meaningful accuracy gains, further boosted by LKD.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。