QUICK REVIEW

[論文レビュー] Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM

Everlyn Asiko Chimoto, Mostafa Elhoushi|arXiv (Cornell University)|Jan 26, 2026

Topic Modeling被引用数 0

ひとこと要約

論文は、非英語および多言語の較正セットが多言語LLMの4ビット事後量子化をGPTQとAWQの両方で改善し、最大3.52 perplexityポイントの改善とダウンストリーム性能の向上を達成することを示している。

ABSTRACT

Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) on two quantizers (GPTQ, AWQ) on data from 10 languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 points in perplexity. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static one-size-fits-all calibration is suboptimal and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.

研究の動機と目的

較正言語構成が多言語LLMの事後量子化へ与える影響を評価する。
英語のみ、非英語、そして多言語の較正セットをGPTQとAWQの各量子化器で比較する。
較正データ分布と活性化範囲が量子化誤差とパープレキシティに与える影響を分析する。
量子化器と対象言語に合わせた較正データ選択の実践的ガイドラインを提供する。

提案手法

Llama3.1 8BとQwen2.5 7Bに対して、GPTQとAWQ（4ビット）を用い、8つの較正セット（5つの単言語、3つの多言語混合）を評価する。
WikipediaとC4で perplexityを測定し、ダウンストリームタスク（XNLI、XStoryCloze、Global MMLU）を評価する。
較正言語効果を説明するために、活性化分布とヘシアンベースの更新を分析する。
言語分散較正の普遍性を検証するためにAny4の結果を含める。

Figure 1: Average perplexity on 10 languages for Llama3.1 8B. Multilingual calibration achieves the lowest perplexity (14.64), illustrating that calibration language affects quantization quality.

実験結果

リサーチクエスチョン

RQ1RQ1: 較正セットの言語構成は、言語間の量子化精度にどう影響するか。
RQ2RQ2: 較正データ内の外れ値トークンや極端な活性化が量子化誤差を誘発するか。
RQ3RQ3: 異なる較正セットは、GPTQのヘシアンベース更新とAWQの活性化スケーリングとどう相互作用するか。

主な発見

非英語および多言語の較正セットは、言語を問わず英語のみの較正を一般的に上回る。
多言語ミックスは最大の効果を達成し、GPTQでLlama3.1に対して最大3.52 perplexityポイントの改善を実現。
評価言語と整合した較正は個々の言語で最大の改善をもたらす；AWQは言語一致データから利益を得る可能性がある。
AWQは活性化スケーリングによる頑健性を示す一方、GPTQはヘシアンベースの更新の影響で較正言語に敏感である。
多様な較正セットは活性化の裾野を広げ、量子化誤差を低減しダウンストリーム性能を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。