Skip to main content
QUICK REVIEW

[论文解读] RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Yingsheng Geng, Yuchong Gao|arXiv (Cornell University)|Feb 28, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

RelayCaching 是一种训练无关的推理方法,通过在下游预填阶段重用上游代理的解码阶段 KV 缓存,并对中间层令牌进行定向修正,达到 80%+ 的 KV 缓存重用和最高 4.7× TTFT 的加速,同时保持精度。

ABSTRACT

The increasing complexity of AI tasks has shifted the paradigm from monolithic models toward multi-agent large language model (LLM) systems. However, these collaborative architectures introduce a critical bottleneck: redundant prefill computation for shared content generated by previous agents, which significantly increases KV cache memory usage and time-to-first-token (TTFT). While various KV cache methods have been proposed to mitigate prefill redundancy, they either fail to maintain accuracy on agent-generated outputs or exhibit low reuse rates due to rigid constraints. We present RelayCaching, a training-free inference method that directly reuses decoding phase KV caches from previous agents in subsequent prefill phases. Our key insight is that KV caches for identical content are highly consistent across phases, while prefix-induced deviations are sparse and localized within a limited range of layers and token positions. By selectively recomputing KV caches at these positions, RelayCaching preserves model accuracy with minimal overhead, yielding a superior accuracy-efficiency trade-off over existing methods. Experiments on diverse collaborative LLM tasks spanning mathematical reasoning, general knowledge, and code generation demonstrate that RelayCaching achieves over 80% KV cache reuse, reduces TTFT by up to $4.7 imes$ compared to the standard pipeline, all with negligible accuracy degradation.

研究动机与目标

  • Motivate reduction of redundant prefill computation in multi-agent LLM pipelines caused by cascading shared content.
  • Characterize how decoding KV caches align with full-prefill caches under decode-to-prefill reuse.
  • Develop a training-free method to selectively rectify KV caches to maintain accuracy.
  • Demonstrate the efficiency gains and accuracy retention of RelayCaching across reasoning, coding, and knowledge benchmarks.

提出的方法

  • 分析跨层和跨令牌的解码与完整预填缓存之间的宏观与微观 KV 缓存对齐情况。
  • 识别 U 形层级相似性模式以及稀疏的跨层相关令牌偏差。
  • 设计一个离线层区间分析器以定位关键中间层区间和一个用于令牌整 rectification 的检测层。
  • 引入一个将偏差基准与影响基准结合的令牌选择器,用以确定需要进行稀疏整 rectification 的令牌集合。
  • 将 RelayCaching 实现为两部分系统:层区间分析器和令牌选择器,使在预填阶段实现有选择的重新计算而非全量重新计算。

实验结果

研究问题

  • RQ1RQ1: RelayCaching 在重用解码 KV 缓存的同时,是否能保持与完全预填相当的生成质量?
  • RQ2RQ2: 在多代理场景中,RelayCaching 能实现多少效率提升(KV 缓存重用率和 TTFT 加速)?
  • RQ3RQ3: 层区间分析器与令牌选择器如何推动准确性与效率之间的权衡?
  • RQ4RQ4: RelayCaching 对其关键超参数和任务上下文的敏感性有多大?

主要发现

  • 解码 KV 缓存与完整预填缓存在前缀变化下仍高度对齐,值向量余弦相似度是主要的偏差信号。
  • 中间层在 U 形相似性曲线中表现出最大偏差,主导下游生成质量。
  • 令牌级偏差稀疏且呈现强烈的跨层相关性,使得可以非常有选择地进行整 rectification。
  • RelayCaching 在多项任务中实现超过 80% 的 KV 缓存重用和最高 4.7× 的 TTFT 降幅,同时保持与完全预填相近的准确性。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。