QUICK REVIEW

[論文レビュー] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Mengzhao Chen, Wenqi Shao|arXiv (Cornell University)|Jul 10, 2024

Topic Modeling被引用数 6

ひとこと要約

EfficientQAT は大規模言語モデルを効率的に圧縮する二段階量子化フレームワーク（Block-APとE2E-QP）を導入し、70Bモデルで強力な2ビット量子化性能を、僅かな精度損失とトレーニングメモリ使用量の削減で達成します。

ABSTRACT

Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.

研究の動機と目的

大規模言語モデル（LLMs）の量子化認識訓練の高いメモリ使用量とトレーニングコストに対処する。
初期化が良好で、デプロイ時に最小限のパラメータを訓練する二段階のメモリ効率的な QAT 手法を開発する。
7B から 70B のパラメータを持つベース、指示調整済み、マルチモーダル LLM における堅牢性と性能向上を示す。

提案手法

Block-AP（Block-wise Training of All Parameters）を提案し、各トランスフォーマブロックをブロック単位の再構成で訓練することで、LLM 全体を再訓練することなく全パラメータ訓練を可能にする。
End-to-End Training of Quantization Parameters（E2E-QP）を導入し、量子化重量を固定し、量子化パラメータ（ステップサイズとオプションでゼロ点）だけをエンドツーエンドで最適化する。
標準的な一様量子化を用いて重みを量子化し、学習された s（スケール）と z（ゼロ点）を計算グラフに統合して勾配ベースの最適化を行う。
量子化グループ内で s と z を共有して、メモリと訓練可能パラメータ数を削減する。
Block-AP は堅牢な初期化を提供し、E2E-QP と組み合わせると量子化性能を従来の方法よりも強化できる。

実験結果

リサーチクエスチョン

RQ1Block-AP は後続の LLM の量子化認識訓練のためのメモリ効率的で効果的な初期化を提供できるか？
RQ2Block-AP 初期化骨格の上に E2E-QP（s と z のエンドツーエンド訓練）を適用することで、2-bit、3-bit、4-bit 量子化において高精度と効率性の優位性を提供できるか？
RQ3EfficientQAT は base、instruction-tuned、multimodal LLM（7B–70B）において PTQ、QAT、Q-PEFT のベースラインと比較してどうか？

主な発見

EfficientQAT は低ビット量子化でも強力な性能を発揮し、Llama-2-70B で約 3% の精度低下（72.41 から 69.48）を達成。
Block-AP は堅牢な初期化を提供し、E2E-QP と組み合わせると既存の方法より量子化結果を改善。
E2E-QP は量子化パラメータのみを訓練し、訓練中のメモリ使用量を大幅に削減（例：2-bit 70B で 34.2 GB の単一 A100-80GB GPU）。
EfficientQAT は 2–4 bit で base、instruction-tuned、multimodal LLMs において確立された QAT および Q-PEFT のベースラインを上回る。
推論速度は一様量子化により向上し、大規模モデルの線形層で前向き伝播の速度が最大で 4.4 倍になる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。