QUICK REVIEW

[논문 리뷰] SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Sanjay Kariyappa, G. Edward Suh|arXiv (Cornell University)|2026. 02. 26.

Semantic Web and Ontologies인용 수 0

한 줄 요약

SideQuest는 긴 수명(long-horizon) 에이전트적 추론 동안 오래된 KV-cache 항목을 제거하는 모델 기반의 보조 메모리 관리 스레드를 가능하게 하여, 최소한의 정확도 손실로 피크 메모리 사용량을 줄입니다.

ABSTRACT

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

연구 동기 및 목표

KV caches가 증가함에 따라 장기-수평의 에이전트 주도 추론에서의 메모리 병목을 동기부여한다.
주요 추론과 병렬로 작동하는 보조 작업으로 동작하는 모델 주도 KV cache eviction 메커니즘을 제안한다.
의미론적이고 자기 주도적인 제거가 동적 다단계 작업에서 고정 휴리스틱보다 우수하다는 것을 시연한다.
병렬 보조 추론이 정확도를 유지하면서 피크 메모리 및 메모리 읽기 횟수를 현저히 감소시킨다는 것을 보인다.

제안 방법

SideQuest를 도입하며, 이는 주요 ReAct 추론 프로세스와 병렬로 보조 메모리 관리 스레드를 실행한다.
LRM 자체를 사용하여 오래됨(staleness) 추론을 수행하고 KV cache 항목에 대한 삭제 명령(del_cursors 등)을 생성한다.
Memory management mode라는 고유의 트리거 구문을 가진 보조 작업으로 메모리 관리를 구성하고, hindsight-annotated 데이터를 통한 학습으로 학습한다.
두 개의 추적(trace)로 학습 데이터를 생성한다: 핵심 추론을 보존하기 위한 메인 traces(logit distillation)와 제거(eviction)를 학습시키는 보조 traces(cross-entropy).
트리거 기반 보조 동작을 가능하게 하기 위해 메인 traces의 distillation 손실과 보조 traces의 cross-entropy 손실을 결합한 공동 최적화를 통해 학습한다.

실험 결과

연구 질문

RQ1모델 주도 보조 프로세스가 다단계 에이전트 작업에서 오래된 도구 출력물을 효과적으로 식별하고 제거할 수 있는가?
RQ2병렬 보조 추론이 정확도에 큰 영향을 주지 않으면서 피크 KV-cache 사용량과 메모리 읽기 수를 줄이는가?
RQ3동적이고 장기 맥락의 연구형 워크로드에서 SideQuest가 휴리스틱 KV-cache 제거 방법과 어떻게 비교되는가?

주요 결과

SideQuest는 압축되지 않은 기본 baseline과 비교하여 피크 토큰 사용을 56-65% 감소시킨다.
SideQuest는 baseline에 비해 KV-cache 메모리 읽기를 53-71% 감소시킨다.
정확도 저하는 작으며 FRAMES에서 최대 2%, BrowseComp에서 5%로 휴리스틱 베이스라인보다 우수하다.
서비스 벤치마크에서 SideQuest가 시스템 처리량을 83.9% 증가시키고 총 런타임을 36.8% 감소시킨다.
SideQuest는 비완료 비율이 거의 제로에 가깝게 유지되며, 다수의 휴리스틱 베이스라인이 더 높은 실패율을 초래하는 것과 다르다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.