QUICK REVIEW

[論文レビュー] Efficient LLM Inference on CPUs

Haihao Shen, Hanwen Chang|arXiv (Cornell University)|Nov 1, 2023

Topic Modeling被引用数 10

ひとこと要約

この論文では自動INT4重みのみ量子化フローとCPU最適化LLMランタイムを導入し、CPU上で推論を加速、最小の精度低下でトークンあたりのレイテンシを高速化。3B〜20BパラメータのLLMで、シングルソケット第4世代Intel Xeonプロセッサを用いて評価。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which requires a demand for large memory capacity and high memory bandwidth. In this paper, we propose an effective approach that can make the deployment of LLMs more efficiently. We support an automatic INT4 weight-only quantization flow and design a special LLM runtime with highly-optimized kernels to accelerate the LLM inference on CPUs. We demonstrate the general applicability of our approach on popular LLMs including Llama2, Llama, GPT-NeoX, and showcase the extreme inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

研究の動機と目的

CPU ハードウェア上で大規模言語モデル（LLMs）の効率的な展開を動機づける。
Intel Neural Compressor を活用した自動 INT4 重みのみ量子化フローを提案。
最適化カーネルを備えた CPU 向けテンソルライブラリと LLM ランタイムを開発。
4th Gen Intel Xeon プロセッサ上で人気のLLMs（3B–20B）に対する精度と性能を実証。

提案手法

Intel Neural Compressor を用いた自動 INT4 量子化フロー、GPTQ, SignRound, AWQ, TEQ, RTN をサポートし、32, 64, 128, ... 1024 のような tunable な粒度を提供。
FP32 ベースラインに対して相対的に <1% の精度損失で高品質な INT4 モデルを生成。
ggml に触発された CPU 対象のテンソルライブラリで ISA 範囲（AVX2, AVX512, AVX512_VNNI, AMX）にわたる INT4 カーネルと動的入力量子化をサポート。
KV-cache 最適化と CPU テンソルバックエンドを備えた LLM ランタイム設計で decoder-only トランスフォーマ推論を効率化。
4th Gen Intel Xeon scalable プロセッサ上でオープンソース LLMs（3B–20B）を対象に、ggml ベースの実装と比較評価。

Figure 1: The left part is the automatic INT4 quantization flow: given a FP32 model, the flow takes the default INT4 quantization recipes and evaluates the accuracy of INT4 model; the recipe tuning loop is optional, if INT4 model can meet the accuracy target. The right part is a simplified runtime f

実験結果

リサーチクエスチョン

RQ1自動INT4量子化はFP32ベースラインと比較して、さまざまなLLMでほとんど精度損失のないINT4モデルを生み出せるか（<1%）？
RQ2CPU最適化LLMランタイムはggmlベースのベースラインと比較して、次トークン生成の待ち時間レイテンシでどの程度性能を発揮するか？
RQ3シングルソケット4th Gen Intel Xeon CPUs 上で3B–20BパラメータのLLMを展開する際の実用的なレイテンシと精度の向上は？
RQ4どのCPUカーネルとKV-cache最適化が最もLLM推論の速度向上に寄与するか？

主な発見

INT4 モデルは複数のLLM（例: GPT-J 6B, Llama-2 7B, Llama 7B, GPT-NeoX 20B, Falcon 7B）で FP32 ベースラインの <1% の精度内に収まる。
LLM ランタイムはグループサイズ128で ggml ベース解より最大1.6x、グループサイズ32で最大1.3xの次トークン遅延を上回る。
CPU ハードウェアでの生成待機時間は、4th Gen Intel Xeon Scalable プロセッサのシングルソケットで6B–20Bモデルの場合、トークンあたり20 ms〜80 ms の範囲。
自動INT4量子化とCPU最適化ランタイムを組み合わせたエンドツーエンドのパイプラインは、CPU上で効率的なLLM推論と精度を維持。

Figure 2: Key components in LLM runtime: general and LLM specialized.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。