QUICK REVIEW

[論文レビュー] SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper|arXiv (Cornell University)|Jun 13, 2023

Topic Modeling被引用数 23

ひとこと要約

SqueezeLLM は感度ベースの非均一量子化と Dense-and-Sparse 分解を導入し、最小限の性能低下と顕著なスピードアップを伴う超低ビット（最低でも 3-bit）ポストトレーニング量子化を LLM に対して実現し、単一バッチ生成推論におけるメモリ帯域幅のボトルネックに対処します。

ABSTRACT

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

研究の動機と目的

単一バッチ LLM 推論における主要なボトルネックとしてのメモリ帯域幅を特定し、その量子化戦略への影響を定量化する。
生成品質の劣化をほとんど生じさせずに超低精度を達成するポストトレーニング量子化フレームワークを開発する。
感度に基づく非均一量子化法を提案し、敏感な重み値の周りに量子化ビンを配置する。
Dense-and-Sparse 分解を導入して外れ値と敏感な重みを別個に保存し、効率的な疎表現を実現する。
複数の LLaMA 系モデルとベンチマークにおける困惑度、モデルサイズ、レイテンシの性能向上を示す。

提案手法

感度に基づく非均一量子化は、二次情報（Fisher 情報）ガイダンスに基づく重み中心の配置を用いた重みの敏感さに近づけるよう、重み量子化の重心を配置する加重 k-means 目的関数を用いる。
ヘッセ行列を対角 Fisher 情報行列で近似し、量子化目的関数内の重みの摂動を重み付けする。
Dense-and-Sparse 分解 W = D + S により密な重みと外れ値を分離し、S を疎形式で保存し、D を範囲を縮小して量子化する。
LLaMA、LLaMA-2、OPT、Vicuna モデルを C4、WikiText2、MMLU、および Vicuna ベンチマークで評価し、RTN、GPTQ、AWQ、SpQR と比較する。
GPU 指向の Dense-and-Sparse カーネルを LUT ベースの 3/4-bit 量子化と FP16 演算で実装し、CSR ベースの疎積を用いて外れ値を活用する。

実験結果

リサーチクエスチョン

RQ1どこまで LLM の重みを ultra-low bit 精度（例: 3-bit）に量子化して、生成タスクのエンドツーエンド性能を損なわずに保てるか。
RQ2Fisher 情報を介して最終損失に強く影響する重みに近い量子化中心を配置することは、均一または素朴な非均一方法よりエンドツーエンドの量子化性能を改善するか。
RQ3Dense-and-Sparse 分解は外れ値と高感度な重みを効果的に分離し、より小さなモデルと高速推論を可能にするか。
RQ4実機（例: A6000）での実用的なレイテンシとメモリ帯域幅の利点は、既存の PTQ 手法と比較してどうか。
RQ5提案手法は指示追従やドメイン知識ベンチマーク（例: MMLU、Vicuna）およびより大きなモデルファミリに一般化するか。

主な発見

3-bit SqueezeLLM は、LLaMA-7B の同一メモリ予算で最先端手法と比較して FP16 ベースラインからの perplexity ギャップを最大 2.1x まで縮小。
Dense-and-Sparse 分解により約 0.45% の重みを疎/外れ値として除去し、C4 で LLaMA-7B の場合 perplexity が 7.75 から 7.58 へと改善される等、追加の改善を生む。
A6000 GPU 上で、生成トークンに対して FP16 に対して最大 2.4x のレイテンシ速度アップを達成し、グループ化 GPTQ/AWQ セットアップと比較してメモリ使用量が競合または優位。
指示追従モデル（Vicuna）のゼロショット MMLU で、3-bit SqueezeLLM は AWQ を上回り FP16 精度を 4-bit 量子化と同等に保ち、5-shot ではロバスト性の改善と整合する。
LLaMA およびより大きなモデル（13B、30B、65B）全体で、同程度のモデルサイズとビット幅で GPTQ および AWQ を上回る perplexity を一貫して達成。
Dense のみ（0% 稀疎性）でも 4-bit で FP16 相当の性能に近く、3-bit で大幅な向上を示し、重量量子化のメモリ制約下推論のメリットを強調。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。