QUICK REVIEW

[论文解读] KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization

Zirui Liu, Jiayi Yuan|arXiv (Cornell University)|Jan 1, 2023

Quantum-Dot Cellular Automata被引用 6

一句话总结

KIVI 提出了一种即插即用、无需微调的 2 位非对称量化方法，用于大语言模型推理中的键值（KV）缓存，对键采用逐通道量化，对值采用逐 token 量化。该方法将峰值内存使用量减少 2.6 倍，并支持高达 4 倍的批量大小，使 Llama、Falcon 和 Mistral 模型的吞吐量提升 2.35 倍至 3.47 倍，同时仅造成极小的精度损失。

ABSTRACT

Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6 imes}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4 imes}$ larger batch size, bringing $\mathbf{2.35 imes \sim 3.47 imes}$ throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.

研究动机与目标

为解决在高批量大小和长上下文长度下，大语言模型（LLM）推理中 KV 缓存带来的日益严重的内存与速度瓶颈。
分析主流 LLM 中 KV 缓存的元素分布，以理解量化所面临的挑战与限制。
开发一种硬件友好的、无需微调的 2 位量化方法，在大幅降低内存使用的同时保持模型精度。
通过高效的 KV 缓存压缩，实现在真实 LLM 推理工作负载中支持更大的批量大小和更高的吞吐量。

提出的方法

提出 KIVI，一种即插即用的 2 位非对称量化方法，用于 KV 缓存，无需微调或模型重训练。
对键缓存应用逐通道量化，沿通道维度分组元素，将每通道的量化误差限制在该通道内。
对值缓存应用逐 token 量化，与自回归生成的流式特性保持一致，并将误差隔离在每个 token 内。
将 KV 缓存划分为分组部分和残差部分：对分组部分应用分组量化，残差部分保持全精度。
在注意力计算过程中使用分块矩阵乘法，将分组和残差缓存部分合并，以保持精度。
采用硬件友好的实现方式，最大限度减少预填充和解码阶段的量化开销。

实验结果

研究问题

RQ1LLM 中键缓存与值缓存的元素分布有何不同？这些差异对量化策略有何影响？
RQ2为何对键缓存采用逐通道量化更有效，而对值缓存则更适合逐 token 量化？
RQ3在 KV 缓存具有流式和动态特性的情况下，2 位量化方案能否在不微调的前提下实现高模型精度？
RQ4不同量化方案对真实 LLM 推理工作负载中内存使用、批量大小和吞吐量的影响如何？
RQ5KIVI 中的分组与残差拆分机制如何在保持精度的同时，实现对大规模流式 KV 缓存的高效量化？

主要发现

对键缓存进行逐通道量化可减少通道间的误差传播，且由于存在高幅值的异常通道，该方法至关重要。
对值缓存进行逐 token 量化是必要的，因为值缓存作为注意力计算中的混合器，误差必须在每个 token 内部隔离。
KIVI 在 Llama-2-7B 上将峰值内存使用量减少 2.6 倍，同时在多个基准测试中保持了近乎相同的模型质量。
该方法支持高达 4 倍的批量大小，并在真实 LLM 推理工作负载中实现 2.35 倍至 3.47 倍的吞吐量提升。
消融实验表明，分组大小和残差长度对性能有可测量但可控的影响，最优配置通过实验确定。
KIVI 在包括 Llama、Falcon 和 Mistral 在内的多种模型上均保持优异性能，即使在 2 位精度下也仅造成极小的精度下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。