QUICK REVIEW

[論文レビュー] Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Jeonghoon Kim, Jung Hyun Lee|arXiv (Cornell University)|May 23, 2023

Topic Modeling被引用数 28

ひとこと要約

PEQA は量子化された LLM の量子化スケールのみを更新し、整数重みを凍結します。これにより、メモリ効率の良いファインチューニングとサブ4ビット量子化での高速デプロイを実現し、性能を維持または向上させます。

ABSTRACT

Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer state during fine-tuning, the inherent size of pre-trained LLM weights continues to be a pressing concern. Even though quantization techniques are widely proposed to ease memory demands and accelerate LLM inference, most of these techniques are geared towards the deployment phase. To bridge this gap, this paper presents Parameter-Efficient and Quantization-aware Adaptation (PEQA) - a simple yet effective method that combines the advantages of PEFT with quantized LLMs. By updating solely the quantization scales, PEQA can be directly applied to quantized LLMs, ensuring seamless task transitions. Parallel to existing PEFT methods, PEQA significantly reduces the memory overhead associated with the optimizer state. Furthermore, it leverages the advantages of quantization to substantially reduce model sizes. Even after fine-tuning, the quantization structure of a PEQA-tuned LLM remains intact, allowing for accelerated inference on the deployment stage. We employ PEQA-tuning for task-specific adaptation on LLMs with up to 65 billion parameters. To assess the logical reasoning and language comprehension of PEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instruction dataset. Our results show that even when LLMs are quantized to below 4-bit precision, their capabilities in language modeling, few-shot in-context learning, and comprehension can be resiliently restored to (or even improved over) their full-precision original performances with PEQA.

研究の動機と目的

大規模言語モデル（LLMs）のファインチューニング時のメモリと計算量の削減を動機づける。
量子化を用いた量子化された LLM に対するパラメータ効率的ファインチューニング（PEFT）を橋渡しする。
整数重みを凍結しつつ、量子化スケールのみを調整する方法（PEQA）を提案・検証する。
65Bパラメータモデルまでのスケーラビリティと量子化下での性能回復を実証する。
タスク固有の適応と指示付き調整の両方のシナリオでの利点を示す。

提案手法

PEQA を導入する：量子化重みのインデックスを固定したまま、チャネルごとの量子化スケールのみを更新する。
事前学習済み重みをチャネルごとのスケールとゼロ点で量子化し、整数量子化インデックス ${\bm{\bar{W}}}_{0}$ を凍結する。
下流タスクのために量子化スケール ${\bm{s}}_{0}$（および更新 ${\bm{\nabla s}}$）を標準のファインチューニング目的関数で微調整する。
凍結した整数重みを変更せず、タスク固有のスケール ${\bm{s}}_{0}+\bm{\nabla s}$ を入れ替えることでタスク切替を可能にする。
PEQA が重みのみを量子化した LLM および重みと活性化の量子化の適用可能性を実証する。
困惑度で QAT および PEFT+PTQ と比較し、サブ4ビット精度で競争力のある性能を示す。

実験結果

リサーチクエスチョン

RQ1整数重みを凍結したまま量子化スケールのみを更新することで、フルファインチューニングや他の PEFT ＋量子化手法の性能に匹敵するか、あるいはそれを上回るか。
RQ2PEQA は大規模モデル（最大 65B パラメータ）へスケール可能で、トレーニングとデプロイ時のメモリ使用量を大幅に削減できるか。
RQ3フル精度のベースラインと比較して、PEQA は量子化された LLM の言語モデリング、少数ショット・文脈学習、理解力を回復または向上させるか。
RQ4複数のデータセットとモデルサイズに対して、タスク固有の適応と指示調整の両方において PEQA は有益か。

主な発見

手法	ファインチューニング DRAM	デプロイ DRAM	推論速度	タスク切替
Full Fine-Tuning	457 GB	131 GB	Slow	Slow
PEFT	131 GB	131 GB	Slow	Fast
PEFT+PTQ	131 GB	33 GB	Fast	Slow
PTQ+PEFT	33 GB	33 GB	Slow	Fast
PEQA (Ours)	33GB	33GB	Fast	Fast

PEQA は QAT と競合する困惑度を達成し、3-bit および 4-bit 設定で LoRA+OPTQ を上回る。
PEQA はファインチューニングとデプロイ時の DRAM 使用量を大幅に削減し、メモリ制約下でより大きなモデルを実現可能にする。
Wikitext2 と PennTreeBank では、モデルサイズが大きくなるにつれて困惑度がフル精度 LoRA に収束する形で、PEQA のスケーラビリティを示す。
Alpaca を用いた指示調整は、RTN-quantized LLM での few-shot in-context learning と理解力の性能を PEQA が回復できることを示している。
PEQA は学習可能パラメータ数とモデルサイズを削減し、最大 65B モデルの DRAM 効率の良いファインチューニングを可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。