Skip to main content
QUICK REVIEW

[論文レビュー] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Chao Jin, Zili Zhang|arXiv (Cornell University)|Apr 18, 2024
Topic Modeling被引用数 5
ひとこと要約

RAGCacheは、取得された文書の中間のキー-バリューテンソルを保存・共有する多層ダイナミックキャッシュシステムを導入し、RAGにおける再計算を削減して、待機時間とスループットの大幅な向上を実現します。

ABSTRACT

Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

研究の動機と目的

  • Identify performance bottlenecks in retrieval-augmented generation (RAG) systems.
  • Characterize opportunities for caching intermediate states of retrieved knowledge to reduce computation.
  • Design a multilevel dynamic caching system that supports cross-request sharing of key-value tensors.
  • Develop a replacement policy and scheduling techniques tailored to RAG's prefix-sensitive attention behavior.
  • Demonstrate a prototype and quantify latency and throughput improvements over state-of-the-art baselines.

提案手法

  • Characterize RAG systems and pinpoint the long sequence generation caused by knowledge injection as the bottleneck.
  • Propose RAGCache, a knowledge-tree based cache that stores and shares intermediate key-value tensors across requests.
  • Introduce the prefix-aware Greedy Dual-Size Frequency (PGDSF) replacement policy to manage GPU/host memory tiers.
  • Implement cache-aware request reordering to improve hit rate and reduce thrashing.
  • Develop dynamic speculative pipelining to overlap retrieval and inference and reduce end-to-end latency.
  • Evaluate RAGCache on vLLM with Faiss and compare to SGLang and vLLM baselines.

実験結果

リサーチクエスチョン

  • RQ1What is the primary performance bottleneck in RAG systems, and can caching intermediate states of retrieved knowledge mitigate it?
  • RQ2Can cross-request sharing of key-value tensors and a multilevel cache improve TTFT and throughput in RAG?
  • RQ3How should a replacement policy and scheduling be designed to respect the prefix-sensitive nature of LLM token generation in RAG?
  • RQ4Does speculative pipelining effectively overlap retrieval and inference without compromising system stability?
  • RQ5What are the empirical gains when integrating RAGCache with existing LLM inference and vector databases?

主な発見

  • RAG generation bottlenecks are dominated by the prefill phase, which scales with sequence length and document size.
  • Caching key-value tensors of retrieved documents significantly reduces prefill latency compared to full recomputation.
  • RAGCache achieves up to 4× reduction in time to first token (TTFT) and up to 2.1× throughput improvement versus vLLM with Faiss.
  • RAGCache outperforms SGLang by up to 3.5× TTFT reduction and up to 1.8× throughput increase.
  • The knowledge tree enables prefix-aware caching that aligns with document order and LLM state dependency.
  • Dynamic speculative pipelining further reduces end-to-end latency by overlapping retrieval with inference.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。