QUICK REVIEW

[논문 리뷰] Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM

Everlyn Asiko Chimoto, Mostafa Elhoushi|arXiv (Cornell University)|2026. 01. 26.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 비영어 및 다국어 보정 세트가 GPTQ와 AWQ 전반에 걸친 다국어 LLM의 4비트 사후 퀀타이제이션을 개선하고, 최대 3.52 perplexity 포인트의 개선과 더 나은 다운스트림 성능을 달성한다는 것을 보여준다.

ABSTRACT

Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) on two quantizers (GPTQ, AWQ) on data from 10 languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 points in perplexity. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static one-size-fits-all calibration is suboptimal and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.

연구 동기 및 목표

다국어 LLM의 사후 퀀타이제이션에 대한 보정 언어 구성의 영향을 평가한다.
퀀타이저(GPTQ와 AWQ) 전반에서 영어 단독, 비영어 및 다국어 보정 세트를 비교한다.
보정 데이터 분포와 활성화 범위가 퀀타이제이션 오차와 perplexity에 미치는 영향을 분석한다.
퀀타이저와 대상 언어에 맞춘 보정 데이터를 선택하기 위한 실용적 가이드라인을 제공한다.

제안 방법

GPTQ와 AWQ(4비트)를 사용하여 Llama3.1 8B 및 Qwen2.5 7B에서 다섯 개의 단일 언어와 세 개의 다국어 혼합으로 총 여덟 개 보정 세트를 평가한다.
Wikipedia와 C4에서 perplexity를 측정하고 다운스트림 작업(XNLI, XStoryCloze, Global MMLU)을 평가한다.
활성화 분포와 해시안 기반 업데이트를 분석하여 보정-언어 효과를 설명한다.
퀀타이저 간 언어 다양성 보정의 일반성을 검증하기 위해 Any4 결과를 포함한다.

Figure 1: Average perplexity on 10 languages for Llama3.1 8B. Multilingual calibration achieves the lowest perplexity (14.64), illustrating that calibration language affects quantization quality.

실험 결과

연구 질문

RQ1RQ1: 보정 세트의 언어 구성은 언어 간 양자화 정확도에 어떤 영향을 미치는가?
RQ2RQ2: 보정 데이터의 이상 토큰(outlier) 또는 극단적 활성화가 퀀타이제이션 오차를 유발하는가?
RQ3RQ3: 서로 다른 보정 세트가 GPTQ의 해시안 기반 업데이트와 AWQ의 활성화 스케일링에 어떻게 상호작용하는가?

주요 결과

비영어 및 다국어 보정 세트가 일반적으로 언어 간 영어 단독 보정보다 우수합니다.
다국어 혼합이 가장 큰 이득을 달성하며, GPTQ를 사용한 Llama3.1에서 최대 3.52 perplexity 포인트의 향상을 보였습니다.
평가 언어와 보정을 정렬하면 개별 언어에 대해 최대의 개선을 얻고, AWQ는 언어-일치 데이터로 이익을 얻을 수 있습니다.
활성화 스케일링으로 인해 AWQ는 강건성을 보이고, GPTQ는 해시안 기반 업데이트로 인해 보정 언어에 더 민감합니다.
다양한 보정 세트는 활성화 꼬리를 넓혀 퀀타이제이션 오차를 줄이고 다운스트림 성능을 향상시킵니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.