QUICK REVIEW

[論文レビュー] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Michael Lewis|arXiv (Cornell University)|Aug 15, 2022

Ferroelectric and Negative Capacitance Devices被引用数 112

ひとこと要約

LLM.int8()を導入。2段階の量子化アプローチを組み合わせたベクトル-wise 8ビット量子化と混合精度分解により、精度低下なしで最大175Bパラメータのトランスフォーマーの8ビット推論を実現し、一般消費者向けGPUで大規模モデルへのアクセスを可能にする。

ABSTRACT

Large language models have been widely adopted but require significant GPU memory for inference. We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance. With our method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. This is made possible by understanding and working around properties of highly systematic emergent features in transformer language models that dominate attention and transformer predictive performance. To cope with these features, we develop a two-part quantization procedure, LLM.int8(). We first use vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, to quantize most of the features. However, for the emergent outliers, we also include a new mixed-precision decomposition scheme, which isolates the outlier feature dimensions into a 16-bit matrix multiplication while still more than 99.9% of values are multiplied in 8-bit. Using LLM.int8(), we show empirically it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. This result makes such models much more accessible, for example making it possible to use OPT-175B/BLOOM on a single server with consumer GPUs. We open-source our software.

研究の動機と目的

大規模トランスフォーマー推論におけるメモリ削減の動機づけと性能への影響の定量化。
マルチボリュームパラメータモデルで全精度を保つ8ビット量子化手法の開発。
量子化を妨げる大規模出現型の異常値特徴を特定・定量化。
異常値を損なわず8ビット効率を維持する混合精度分解の提案。
普及を促すオープンソースツールと統合の提供。

提案手法

内積ごとに正規化定数を持つベクトル-wise 8-bit量子化を提案し精度を向上。
稀な異常値特徴次元を16-bitパスに分離しつつ他は99.9%を8-bitのままにする混合精度分解を導入。
行列積を異常値（16-bit）と通常部（8-bit）に分解し、外積定数でデノーマライズ。
モデルサイズが拡大するにつれて emergentな大振れの異常値特徴と量子化への影響を分析。
125M～13Bのモデルで perplexity (C4) を評価し、OPTモデルの最大175Bまでのzeroshot精度を検証。
huggingface transformers 連携を伴う bitsandbytes のオープンソース化。

実験結果

リサーチクエスチョン

RQ18ビット量子化はトランスフォーマー推論の性能をどこまで維持できるか？
RQ2スケール時に出現する異常値特徴は何で、それらは量子化精度にどう影響するか？
RQ3異常値のために16-bitの精度を維持しつつ、その他を8-bit量子化する混合精度分解は成り立つか？
RQ4モデルサイズが大きくなるにつれてベクトル-wise量子化は他の8-bit量子化方式より優れているか？
RQ5175B+モデルを一般消費者向けGPUに展開する実用的な意味は何か？

主な発見

モデルサイズ（パラメータ）	32-bit Float	Int8 absmax	Int8 zeropoint	Int8 absmax row-wise	Int8 absmax vector-wise	Int8 zeropoint vector-wise	Int8 absmax row-wise + decomposition	LLM.int8() (vector-wise + decomp)	Zeropoint LLM.int8() (vector-wise + decomp)
125M	25.65	87.76	56.66	30.93	35.84	25.72	30.76	25.83	25.69
1.3B	15.91	16.55	16.24	17.08	16.82	15.94	16.19	15.93	15.92
2.7B	14.43	15.11	14.76	15.24	14.98	14.36	14.65	14.44	14.43
6.7B	13.24	14.59	13.49	14.13	14.13	13.38	13.25	13.24	13.24
13B	12.45	19.08	13.94	16.49	16.48	13.47	12.46	12.45	12.45

LLM.int8()は125Mから13Bのモデルで perplexity を維持し、absmax/row-wise/zeropointベースラインと異なりモデルサイズの拡大とともに劣化しない。
zeroshotタスクでOPTモデル最大175Bでは、LLM.int8()は全精度（16-bit）を維持する一方、ベースラインの8-bit手法は劣化。
6.7Bパラメータ付近から大振れの大きさの異常値特徴が現れ、1シーケンスあたり約150kの異常値が約6つの特徴次元に集中。
混合精度分解（異常値は16-bit、その他は8-bit）は、最大175Bパラメータまでゼロデgradation推論を実現。
メモリ削減は大きく、BLOOM-176Bで約2x、実行時間への影響は控えめで、非常に大きなモデルでは一般消費者向けGPUで8-bit推論が実現可能になる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。