QUICK REVIEW

[논문 리뷰] ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management

Jing Zou, Shangyu Wu|arXiv (Cornell University)|2026. 01. 20.

Big Data and Digital Economy인용 수 0

한 줄 요약

ContiguousKV는 ContiguousChunk를 도입하는 그레나리티에 맞춘 KV 캐시 오프로딩 시스템으로, 프루닝과의 데이터 관리의 공동 설계, 그리고 Re-Prefill 단계를 가속하는 2단계 비동기 프리패칭 및 주의도(attention)-기반 캐시 관리 기능을 제공합니다.

ABSTRACT

Efficiently serving Large Language Models (LLMs) with persistent Prefix Key-Value (KV) Cache is critical for applications like conversational search and multi-turn dialogue. Serving a request requires loading the pre-computed prefix KV cache and generating the first token, defined as the Re-Prefill Phase. Offloading this shared prefix cache to secondary storage is essential for memory scalability. Re-Prefill with offloading suffers from severe I/O bottlenecks in two aspects. First, semantic-aware KV cache pruning algorithms select important tokens in fine granularity, while systems manage I/O in coarse, fixed-size blocks, causing severe read amplification. Second, the sequential dependency between identifying important tokens and loading KV cache creates idle I/O and compute bubbles, under-utilizing system resources. This paper proposes extit{ContiguousKV}, a high-performance prefix KV cache offloading system that bridges algorithmic semantics with I/O efficiency to accelerate the Re-Prefill phase. We first introduce extit{ContiguousChunk}, a unified data management granularity that aligns KV cache pruning with I/O operations. All the mechanisms critical for I/O performance are performed at the granularity of ContiguousChunk, thereby eliminating read amplification. By exploiting the high similarity in important ContiguousChunk indices across layers, we propose intra- and inter-period asynchronous prefetching to break the sequential dependency between I/O and compute, effectively eliminating idle bubbles. Finally, we propose attention-guided cache management to retain semantically critical prefix data in memory. Evaluations on Qwen2.5 series models show that ContiguousKV achieves a 3.85x speedup in the Re-Prefill phase over the state-of-the-art offloading system IMPRESS, while maintaining high output quality.

연구 동기 및 목표

공유 접두사 LLM 서비스에서 읽기 증폭 및 리소스 활용 저하를 다루어 Re-Prefill의 효율성을 높이는 것을 목표로 한다.
ContiguousChunk를 KV 캐시 가지치기, 저장소, I/O를 정렬하는 통일된 단위로 제안한다.
I/O와 계산을 파이프라인화하기 위한 기간 내(intra-period) 및 기간 간(inter-period) 비동기 프리패칭을 개발한다.
Semantically 중요한 접두 데이터의 우선 순위를 정하는 주의도(attention-guided) 캐시 관리 도입한다.
여러 KV 캐시 예산에서 Qwen2.5 시리즈 모델에 대한 성능 향상을 평가한다.

제안 방법

ContiguousChunk를 저장, 제거, 프리패칭을 위한 연속 토큰의 단위로 정의한다.
기간 내(intra-period)와 기간 간(inter-period)으로 이뤄진 2단계 프리패칭 엔진을 개발하여 I/O와 계산을 파이프라인화한다.
중요한 ContiguousChunk 인덱스의 교차 층 및 교차 기간 유사성을 활용해 I/O 지연을 숨긴다.
GPU/CPU 메모리에서 ContiguousChunks의 우선순위를 결정하는 캐시 점수 S_j = I_j × F_j를 사용하는 주의도(attention-guided) 캐시 정책을 구현한다.
FlexGen 프레임워크에 구현을 통합하고 Baseline으로 IMPRESS 및 AttentionStore를 비교한다.

실험 결과

연구 질문

RQ1그레나리티에 맞춘 ContiguousChunk가 Re-Prefill 중 읽기 증폭에 어떤 영향을 미치는가?
RQ2 intra-period 및 inter-period 비동기 프리패칭이 Re-Prefill 단계의 여유 컴퓨트/입출력 버블을 줄일 수 있는가?
RQ3주의도 기반 캐시 관리가 의미적으로 중요한 접두 데이터의 적중률을 높이는가?
RQ4Qwen2.5 모델에서 ContiguousKV가 최첨단 오프로딩 시스템에 비해 어떤 성능 이점을 제공하는가?
RQ5다양한 KV 예산 비율에서도 이득은 얼마나 견고한가?

주요 결과

ContiguousKV는 IMPRESS 대비 Re-Prefill 단계에서 3.85x 속도 향상을 달성한다.
ContiguousChunk를 사용하면 프루닝의 그레나리티에 맞춰 I/O를 정렬해 읽기 증폭을 제거한다.
intra-period 및 inter-period 프리패칭이 I/O를 계산과 함께 파이프라인화하여 유휴 버블을 줄인다.
주의도(attention-guided) 캐시 관리가 의미적으로 중요한 데이터의 캐싱을 개선한다.
Qwen2.5 모델에서 출력 품질 유지하면서 속도 향상을 달성하는 평가를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.