QUICK REVIEW

[論文レビュー] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Muhammad Adnan, Akhil Arunkumar|arXiv (Cornell University)|Mar 14, 2024

Algorithms and Data Compression被引用数 6

ひとこと要約

Keyformerは、Gumbelベースのスコアとロジット正則化を用いてキー・トークンと最近のトークンを選択することでKVキャッシュを推論時に削減する手法を導入し、最大で2.1xのレイテンシ削減と2.4xのスループット向上を、70%のKVキャッシュを用いながら精度を維持して達成します。

ABSTRACT

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces "Keyformer", an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as "key" tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer's reduction of KV cache reduces inference latency by 2.1x and improves token generation throughput by 2.4x, while preserving the model's accuracy.

研究の動機と目的

長い文脈を持つ自己回帰型言語モデル推論におけるKVキャッシュのレイテンシと帯域のボトルネックを動機づける。
モデルの精度を保ちながらKVキャッシュを削減する推論時機構を提案する。
最近のトークンとともに小さなキー・トークン集合を識別・保持して、縮小されたKVキャッシュを形成する。
デコードステップ全体でキー・トークンを識別するために、Gumbelベースの正規化を伴う学習風のスコア関数を開発する。
異なる位置埋め込みを持つモデルや長い文脈を含むタスクに対して頑健性を示す。

提案手法

注意分布の約90%が、キー・トークンと呼ばれる小さなトークン集合に集中していることを観察する。
最近の窓幅(w)とキー・トークン(k-w)を混ぜて、縮小されたKVキャッシュを形成することでk個のトークンを保持するKeyformerを提案する。
未正規化ロジットに対するGumbelベースの正規化を導入してキー・トークンを識別し、トークン削除後の分布シフトを緩和する。
デコーディングステップが進むにつれて増加する温度スケジュールτを用いて、トークンが削除される際に確率分布を調整する。
プロンプトフェーズと生成フェーズを通じてスコア関数f_thetaを蓄積して、一貫してキー・トークンを識別する。

Figure 1: (a) Inference latency normalized to sequence length of 512. We measure the $\mathsf{KV}$ $\mathsf{cache}$ data movement for MPT-7B Team et al. ( 2023 ) model with varying sequence length (50% context + 50% text generation). (b) The $\mathsf{KV}$ $\mathsf{cache}$ size and model size as sequ

実験結果

リサーチクエスチョン

RQ1Keyformerは推論時にKVキャッシュのサイズを削減しつつ、生成品質をMLPerfの精度 targets（ベースラインの99–99.9%）を超えて低下させずに保てるか？
RQ2Keyformerのキー・トークン選択と混合アテンションは、異なる位置埋め込みを持つモデル間でROUGEベースの要約品質と長文脈の性能を維持または向上させるか。
RQ3さまざまなKVキャッシュ予算の下で、KeyformerはWindow AttentionおよびH2Oと比べてレイテンシ、スループット、精度の点でどうなるか。
RQ4KVキャッシュからトークンを削除する際に精度を維持するために、Gumbelベースのロジット正規化は必須か。

主な発見

KeyformerはKVキャッシュを約50%削減し、ほぼMLPerfレベルの精度を維持する（ROUGEのターゲットは完全なアテンションの99–99.9%内）。
50%のKVキャッシュ削減で推論レイテンシを2.1倍、トークン生成スループットを2.4倍向上させる。
KVキャッシュの70%のみで基準精度を維持し、より大きな予算でも苦戦するH2Oを上回る。
GPT-J、Cerebras-GPT、MPT の varied position embeddings で、KV予算を削減したときにKeyformerはWindow AttentionとH2Oを精度で上回る。
長文context要約（GovReportのMPT-7B-storywriter）では、50%KVキャッシュで99%の精度を維持し、H2Oとは異なる。
要約や会話を含むタスクで、速度を向上させつつ品質を維持するアプローチで、文脈長とモデル差異に対する頑健性を示す。

Figure 2: Attention block for generative inference. (a) Full attention Brown et al. ( 2020 ) with current token attending all previous tokens. (b) Window attention ( $w=4$ ): Focusing on the most recent 4 tokens. (c) Dilated window attention ( $w=4$ , dilation = 1). (d) $\mathsf{Keyformer}$ ( $w=2$

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。