QUICK REVIEW

[论文解读] CacheMind: From Miss Rates to Why -- Natural-Language, Trace-Grounded Reasoning for Cache Replacement

Kaushal Mhapsekar, Azam Ghanbari|arXiv (Cornell University)|Feb 12, 2026

Parallel Computing and Optimization Techniques被引用 0

一句话总结

CacheMind 提供一个会话式、检索增强系统，将缓存替换分析依托于每事件的跟踪切片与自然语言查询进行推理，并通过新的 CacheMindBench 基准进行验证。

ABSTRACT

Cache replacement remains a challenging problem in CPU microarchitecture, often addressed using hand-crafted heuristics, limiting cache performance. Cache data analysis requires parsing millions of trace entries with manual filtering, making the process slow and non-interactive. To address this, we introduce CacheMind, a conversational tool that uses Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to enable semantic reasoning over cache traces. Architects can now ask natural language questions like, "Why is the memory access associated with PC X causing more evictions?", and receive trace-grounded, human-readable answers linked to program semantics for the first time. To evaluate CacheMind, we present CacheMindBench, the first verified benchmark suite for LLM-based reasoning for the cache replacement problem. Using the SIEVE retriever, CacheMind achieves 66.67% on 75 unseen trace-grounded questions and 84.80% on 25 unseen policy-specific reasoning tasks; with RANGER, it achieves 89.33% and 64.80% on the same evaluations. Additionally, with RANGER, CacheMind achieves 100% accuracy on 4 out of 6 categories in the trace-grounded tier of CacheMindBench. Compared to LlamaIndex (10% retrieval success), SIEVE achieves 60% and RANGER achieves 90%, demonstrating that existing Retrieval-Augmented Generation (RAGs) are insufficient for precise, trace-grounded microarchitectural reasoning. We provided four concrete actionable insights derived using CacheMind, wherein bypassing use case improved cache hit rate by 7.66% and speedup by 2.04%, software fix use case gives speedup of 76%, and Mockingjay replacement policy use case gives speedup of 0.7%; showing the utility of CacheMind on non-trivial queries that require a natural-language interface.

研究动机与目标

推动超越固定未命中率指标的交互式、可解释的缓存替换分析。
实现对数百万跟踪事件的逐 PC 与逐地址的语义查询。
提供一个经验证的基准套件，用于评估微架构场景下的大模型推理。
展示检索增强推理，给出基于跟踪的政策与工作负载交互解释。

提出的方法

引入 CacheMind，这是一个双检索系统（Sieve 与 Ranger）加上一个生成式大模型，用以产生基于跟踪的解释。
Sieve 通过符号-语义筛选，从 ChampSim 跟踪中提取任务特定的跟踪切片。
Ranger 将自然语言查询转化为对外部跟踪数据库的可执行检索代码。
使用检索增强生成（RAG）将大模型输出与检索到的跟踪证据绑定。
开发 CacheMindBench，一个覆盖事实、对比、算术和语义推理的100道题基准，针对跟踪进行评估。

Figure 1 . The method filters raw traces to a task-specific slice and returns the most informative evidence for the user’s query. Old ChampSim could tell you a miss; CacheMind shows which PC missed on which data, under which policy, and why, for every event, acting as a microarchitectural microscope

实验结果

研究问题

RQ1一个会话式、基于跟踪的系统是否能够对逐事件、逐 PC 的缓存问题给出可验证证据的回答？
RQ2符号-语义检索与基于大模型的检索在缓存分析的精度和灵活性上有何差异？
RQ3将大模型推理绑定到跟踪数据对准确性与可信度有何影响？
RQ4从基于跟踪的推理中，可以为规避预测、软件修复与策略设计带来何种可操作的洞见？

主要发现

在 Sieve 检索器下，CacheMind 在75道未见过的基于跟踪的问题上达到66.67%的精准度，在25道未见过的与策略相关的推理任务上达到84.80% 的精准度。
结合 Ranger 时，在同样的评估中分别达到89.33%与64.80%，并在6个基于跟踪的类别中有4类达到100%准确率。
CacheMindBench 在所评估的设置中显示出比 LlamaIndex 高出9倍的检索准确度。
在跨基准测试中，CacheMind 提供了可执行的洞见，如越过/绕过相关的命中率提升7.66%、IPC 提升2.04%、软件修复速度提升76%、Mockingjay RDP 场景下的0.7%加速。
CacheMind 表明，具备推理能力且基于跟踪的分析，能够超越传统的固定指标报告，在缓存策略评估中具有更高的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。