QUICK REVIEW

[논문 리뷰] Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

Aishwarya Bhandare, Vamsi Sripathi|arXiv (Cornell University)|2019. 06. 03.

Topic Modeling참고 문헌 14인용 수 57

한 줄 요약

이 논문은 INT8/VNNI를 이용해 Intel CPU에서 훈련된 Transformer 번역 모델을 8-bit 정수로 양자화하여 BLEU 감소를 0.5 미만으로 낮추고 FP32 대비 최대 1.5x 순속도향상을 달성하는 등 그래프 및 파이프라인 최적화를 수행합니다.

ABSTRACT

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\circledR$ Xeon$^\circledR$ Cascade Lake processors to improve inference performance while maintaining less than 0.5$\%$ drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost performance of quantized deep learning inference and establishes best practices to run inference with high efficiency on Intel CPUs.

연구 동기 및 목표

Demonstrate quantization of a trained FP32 Transformer translation model to INT8 with minimal BLEU loss.
Investigate the impact of 8-bit quantization on Transformer self-attention and non-linear components.
Develop calibration and graph-optimization techniques to preserve accuracy while accelerating inference.
Enhance inference throughput through optimized MatMul kernels, data flows, and parallel batching on Intel CPUs.

제안 방법

Replace FP32 MatMul operations with QuantizedMatMul using INT8/UINT8 operands and INT32 accumulation.
Calibrate quantization thresholds using KL-divergence to minimize distributional divergence between FP32 and INT8 tensors.
Tune Intel MKL/BLAS kernels to exploit VNNI instructions for accelerated 8-bit MatMuls.
Reorder and prune the computation graph to reduce redundant operations and overhead (e.g., GatherNd, dequantization).
Sort input sentences by token length and apply parallel batching to maximize CPU utilization during inference.

실험 결과

연구 질문

RQ1Can a trained Transformer translation model be quantized to INT8 with less than 0.5 BLEU score drop?
RQ2Which calibration strategy and tensor distributions best preserve accuracy during INT8 quantization?
RQ3What graph-level and system-level optimizations yield the most significant throughput gains for INT8/VNNI inference?
RQ4How do MatMul and non-linear components (Softmax, LayerNorm) influence quantization viability in Transformer models?
RQ5What is the achievable speedup of INT8/VNNI quantized inference compared with FP32 on Intel CPUs?

주요 결과

INT8/VNNI quantization maintained BLEU drop under 0.5 points.
MKL INT8/GEMM kernels with VNNI achieved substantial speedups over FP32—up to 3.7x vs FP32 for MatMul, and 2.3x over AVX512 for INT8 MatMul.
Calibration via KL-divergence with symmetric thresholding yielded best accuracy among modes tested (27.30–27.33 BLEU range).
Optimizations to GatherNd and operation fusion reduced data movement and execution time, improving throughput.
Sorting input by token count and parallel batching increased CPU utilization, contributing to up to 1.5x overall throughput gain over best FP32 setup.
Overall, the approach achieved a net 1.5x throughput improvement for the quantized model over the best FP32 performance.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.