QUICK REVIEW

[论文解读] QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

Dalton Jones, Junyoung Park|arXiv (Cornell University)|Feb 9, 2026

Big Data and Digital Economy被引用 0

一句话总结

QuoKA 是一种训练-free、硬件无关的分块 LLM 预填充稀疏注意力方法，利用余弦相似度选择代表性查询和 KVs，在接近基线精度的前提下实现显著的延迟下降。

ABSTRACT

We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.

研究动机与目标

Motivate reducing prefill latency in transformer inference under chunked prefill.
Propose a lightweight, hardware-agnostic sparse-attention approach that operates on the KV cache.
Show that selecting low-cosine-similarity queries and their most relevant KVs preserves accuracy while reducing computations.
Demonstrate robustness and generalization across models and hardware (GPUs/CPUs).
Provide empirical evidence on long-context and generation-oriented benchmarks.

提出的方法

Retain a small set of representative queries based on cosine dissimilarity to the mean query.
Compute a cosine-similarity proxy to score query–key relevance instead of using dot products.
Aggregate scores across queries and KV groups to select a reduced KV subset.
Feed the reduced KV set into a standard dense attention kernel (e.g., FlashAttention).
Operate within chunked prefill to achieve sub-quadratic complexity in attention.
Maintain portability by using standard linear-algebra operations without custom kernels.

实验结果

研究问题

RQ1How much can KV attention be reduced during chunked prefill without substantial accuracy loss?
RQ2Do cosine-based scoring and geometry-aware query selection outperform generation-oriented or fixed-pattern sparsity in prefill?
RQ3What is QuoKA’s accuracy and latency trade-off across long-context benchmarks and different model families?
RQ4How well does QuoKA generalize across GPUs/CPUs and across decoder-only LLM architectures?
RQ5Can QuoKA maintain performance when B_CP and B_SA budgets vary?

主要发现

QuoKA achieves up to 5x attention speedup on Nvidia GPUs during prefill.
QuoKA delivers about 3x improvements in time-to-first-token (TTFT) on long prompts.
On Intel Xeon CPUs, QuoKA reaches up to nearly 7x speedup, and up to 5-6x on consumer GPUs.
QuoKA uses 88% fewer key-value pairs per attention evaluation while preserving near-baseline accuracy.
Across benchmarks Needle-In-A-Haystack, RULER, LongBench, and Math500, QuoKA outperforms competing sparse attention methods.
Accuracy degrades gradually with sparsity, enabling tunable efficiency-accuracy trade-offs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。