QUICK REVIEW

[论文解读] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Chao Jin, Zili Zhang|arXiv (Cornell University)|Apr 18, 2024

Topic Modeling被引用 5

一句话总结

RAGCache 引入一个多级动态缓存系统，存储并共享检索文档的中间键值张量，以减少 RAG 的重复计算，从而显著降低延迟并提升吞吐量。

ABSTRACT

Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

研究动机与目标

识别检索增强生成（RAG）系统中的性能瓶颈。
表征缓存检索知识的中间状态以减少计算的机会。
设计一个支持跨请求共享键值张量的多级动态缓存系统。
开发适用于 RAG 的前缀敏感注意力行为的替代策略和调度技术。
演示原型并量化相对最先进基线的延迟和吞吐提升。

提出的方法

表征 RAG 系统并定位知识注入导致的长序列生成作为瓶颈。
提出 RAGCache，一个基于知识树的缓存，用于跨请求存储和共享中间键值张量。
引入前缀感知的贪心双尺度频率替换策略（PGDSF）以管理 GPU/主机内存层级。
实现感知缓存的请求重排序以提高命中率并减少抖动。
开发动态推测流水线以重叠检索与推理，降低端到端延迟。
在 vLLM 上结合 Faiss 评估 RAGCache，并与 SGLang 和 vLLM 基线进行比较。

实验结果

研究问题

RQ1RAG 系统的主要性能瓶颈是什么，缓存检索知识的中间状态是否可以缓解该瓶颈？
RQ2跨请求共享键值张量和多级缓存是否能改善 TTFT 和吞吐量？
RQ3应如何设计替代策略与调度以尊重 RAG 中 LLM 令牌生成的前缀敏感特性？
RQ4推测性流水线是否在不影响系统稳定性的前提下有效地重叠检索与推理？
RQ5将 RAGCache 与现有大模型推理和向量数据库整合时的经验性收益是什么？

主要发现

RAG 生成瓶颈主要由预填充阶段主导，该阶段随序列长度和文档大小的增加而扩大。
缓存检索文档的键值张量相比完全重新计算显著降低预填充延迟。
RAGCache 在 TTFT 上实现多达 4× 的减少，在 vLLM 与 Faiss 的对比中实现多达 2.1× 的吞吐提升。
RAGCache 的 TTFT 相比 SGLang 提升多达 3.5×、吞吐量提升多达 1.8×。
知识树促成了与文档顺序和 LLM 状态依赖对齐的前缀感知缓存。
动态推测性流水线进一步通过将检索与推理重叠来降低端到端延迟。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。