QUICK REVIEW

[논문 리뷰] ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

Xiaoxia Wu, Zhewei Yao|arXiv (Cornell University)|2023. 07. 19.

Parallel Computing and Optimization Techniques인용 수 8

한 줄 요약

본 논문은 FP8 활성화와 FP4 가중치가 LLM의 사후 퀀타이제이션에서 전통적인 INT8/INT4를 능가할 수 있음을 보이며, LoRC가 소형 모델을 개선하고 스케일 제약으로 인한 손실이 무시할 만하다는 점을 시사한다.

ABSTRACT

In the complex domain of large language models (LLMs), striking a balance between computational efficiency and maintaining model quality is a formidable challenge. Navigating the inherent limitations of uniform quantization, particularly when dealing with outliers, and motivated by the launch of NVIDIA's H100 hardware, this study delves into the viability of floating-point (FP) quantization, particularly focusing on FP8 and FP4, as a potential solution. Our comprehensive investigation reveals that for LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion. For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100. To mitigate the overhead from precision alignment caused by the disparity between weights and activations, we propose two scaling constraints for weight quantization that negligibly impact the performance compared to the standard W4A8 model. We additionally enhance our quantization methods by integrating the Low Rank Compensation (LoRC) strategy, yielding improvements especially in smaller models. The results of our investigation emphasize the immense potential of FP quantization for LLMs, paving the way for high-efficiency deployment in resource-limited settings.

연구 동기 및 목표

LLMs를 위한 부동소수점 PTQ(FP8/FP4)를 동기부여하고 평가하여 활성화 이상치와 분포 편향을 해결한다.
FP8/FP4 퀀타이제화를 LLaMA 및 OPT 같은 대형 언어 모델과 데이터셋에 대해 INT8/INT4 기반과 비교한다.
W/A 정밀도 불일치를 완화하는 기법(스케일 제약, 비트 시프트)을 조사하고 모델 품질에 미치는 영향을 측정한다.
특히 작은 모델에서 양자화 오차를 줄이는 데 있어 LoRC의 효과를 평가한다.

제안 방법

FP8/FP4 설정에서 가중치 및 토큰별 활성화 양자화를 위한 GPTQ 기반 최적화를 채택한다.
LLaMA 및 OPT 모델 계열에서 FP8 활성화 대 INT8 활성화, FP4 대 INT4 가중치 양자화를 비교한다.
FP4-에서 FP8로의 캐스팅을 용이하게 하고 오버헤드를 줄이기 위해 가중치 양자화의 두 가지 스케일 제약(2의 거듭제곱 스케일)을 도입한다.
LoRC를 도입하여 특히 작은 모델에서 양자화 오차를 추가로 줄인다.
NVIDIA H100와 같은 하드웨어에서 W4A8 캐스팅 전략을 평가하고 FP4 가중치를 FP8 활성화와 정렬시키기 위한 비트 시프트 방법을 제안한다.
위키텍스트-2, PTB, C4 데이터셋과 모델 크기에 대한 차별적 결과를 제공한다.

Figure 1: Distribution of Activation values. The top, middle and bottom rows represents the distributions at the 2nd, 12th and final layer of the pretrained OPT-1.3b model. From the left to right columns, they are respectively for the linear modules attn.q_proj (same as attn.k_proj and attn.v_proj),

실험 결과

연구 질문

RQ1FP8 활성화가 LLM에서 INT8 활성화를 지속적으로 능가하는가, 특히 모델 크기가 1B 매개변수를 넘어갈 때 그렇다면 어떤가?
RQ2FP4 가중치 양자화가 성능에서 INT4에 대등하거나 이를 능가할 수 있는가, 그리고 LoRC가 이 비교에 어떤 영향을 미치는가?
RQ3제안된 두 가지 스케일 제약 방법(거듭제곱 스케일)이 FP4/FP8 W4A8 양자화에서 모델 품질을 보존하는가?
RQ4모델 규모에 걸친 W4A8 양자화에 LoRC를 적용하는 영향은 무엇인가?
RQ5FP8/FP4 양자화 체계가 표준 LLM 벤치마크 및 데이터셋에서 전통적인 INT 기반 양자화와 어떤 차이를 보이는가?

주요 결과

FP8 활성화는 일반적으로 LLaMA 및 OPT 모델 계열 모두에서 INT8 활성화를 능가하며, 대형 모델일수록 이득이 더 두드러진다.
FP8 가중치는 INT8과 경쟁하며, FP4 가중치는 중대형 모델에서 특히 INT4를 능가하는 경우가 많다.
LoRC는 W4A8 양자화를 개선하여 특히 작은 모델에서 양자화 오차를 줄인다.
가중치 스케일을 2의 거듭제곱으로 제한하는(M1 또는 M2) 경우 LoRC를 사용할 때 성능에 미치는 영향이 미미하며, 일반적으로 M2가 M1보다 더 나은 결과를 낸다.
LoRC와 함께 2의 거듭제곱 스케일 전략으로 FP4를 FP8로 캐스팅하는 방법은 하드웨어 친화적 구현을 가능하게 하면서 성능을 유지하는 데 도움이 된다.
데이터셋(WikiText-2, PTB, C4) 및 모델 크기(LLaMA-3b에서 30b, OPT-1.3b에서 30b)에 걸쳐 FP8/FP4 구성이 많은 구성에서 해당 INT 기반 구성보다 더 낮은 혼란도(perplexity)를 달성한다.

Figure 2: A Contrast between INT8 and FP8 Quantization Methods. The top row displays the original vector in its full-precision form. The subsequent row showcases the vector after quantization through the INT8 Asymmetric approach. The final two rows present values quantized by the FP8 method, utilizi

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.