QUICK REVIEW

[論文レビュー] SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao|arXiv (Cornell University)|May 26, 2024

Advanced Data Compression Techniques被引用数 6

ひとこと要約

SpinQuant は回転行列を学習して、LLM の重み・活性化・KVキャッシュの事後量子化を改善し、LLaMA-2/3 モデルにおける 4-bit 量子化の完全精度に対するギャップを劇的に狭めます。full-precision 出力を変更せずに Stiefel 多様体上の回転を最適化するために Cayley SGD を使用します。

ABSTRACT

Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.

研究の動機と目的

重みと活性化における外れ値の影響による LLMs の量子化課題を動機づけ、対処する。
フル精度出力を保持しつつ、より良い量子化を可能にする回転ベースの不変パラメータ化を導入する。
校正セット上の量子化損失を最小化するために、Stiefel 多様体上で Cayley SGD による回転の学習を提案する。
複数の LLaMA-2/3 モデルサイズとタスクにわたって、既存の PTQ 手法より改善を実証する。
GPTQ との互換性と、回転成分のアブレーションを通じた頑健性を示す。

提案手法

トランスフォーマーの複数のポイントで回転をパラメータ化し、新たなパラメータを追加せずに外れ値を減らす。
可能な場所では R1, R2 を重みに吸収して、フル精度出力を同一に保つ。
吸収が不可能な場合には、KVキャッシュおよび特定のブロックに対してオンラインHadamard回転 R3 および R4 を使用する。
小さな校正セット上で量子化損失を最小化するため、Stiefel 多様体上の R1 および R2 を Cayley SGD で最適化する。

実験結果

リサーチクエスチョン

RQ1残差、アテンション、KV-cache 経路で学習された回転は、外れ値を sufficiently 抑制し、LLMs の 4-bit 量子化を改善できるだろうか？
RQ2Stiefel 多様体上の Cayley SGD による回転最適化は、ランダム回転や Hadamard 回転より一貫した利益をもたらすか？
RQ3SpinQuant は、最先端の PTQ 手法（例：GPTQ、SmoothQuant、QuaRot）と比較して、LLaMA-2/3 モデルと 4-bit 設定でどの程度の性能を示すか？
RQ4回転ベースの手法は既存の量子化パイプラインと互換性があり、フル精度ネットワーク出力と独立しているか？
RQ5個々の回転成分（R1–R4）の量子化性能への影響はどの程度か？

主な発見

Cayley SGD による回転行列の最適化は、複数のモデルとタスクにおいてランダム回転よりはるかに良い量子化性能をもたらす。
4-bit W-A-KV 量子化において、SpinQuant は完全精度との差を数ポイントに縮める（例：LLaMA-2 7B で 2.9 ポイント）し、ゼロショットタスクで QuaRot および SmoothQuant を上回る。
回転ベースの量子化は、外れ値をより均等に分散させることで、活性化と重量の量子化の両方を改善し、量子化誤差を低減する。
SpinQuant は量子化が難しいモデル（LLaMA-3 8B/70B）に対して強力な改善を示し、GPTQ との互換性を維持する。
アブレーション研究は、複数回転（R1–R4）を追加することが一般に精度を向上させることを示し、R4（オンライン）はKV関連経路で顕著な利得を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。