QUICK REVIEW

[论文解读] Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems

Harmohit Singh|arXiv (Cornell University)|Jan 16, 2026

Natural Language Processing Techniques被引用 0

一句话总结

论文提出了一个面向生产优化的多智能体 NL2Code 系统，该系统使用语义缓存、双阈值缓存决策和意图驱动提示，将自然语言查询转换为可执行的 Python，在 1 万+ 查询上实现 94.3% 的语义准确性和 8.2s 的延迟。

ABSTRACT

We present a production-optimized multi-agent system designed to translate natural language queries into executable Python code for structured data analytics. Unlike systems that rely on expensive frontier models, our approach achieves high accuracy and cost efficiency through three key innovations: (1) a semantic caching system with LLM-based equivalence detection and structured adaptation hints that provides cache hit rates of 67% on production queries; (2) a dual-threshold decision mechanism that separates exact-match retrieval from reference-guided generation; and (3) an intent-driven dynamic prompt assembly system that reduces token consumption by 40-60% through table-aware context filtering. The system has been deployed in production for enterprise inventory management, processing over 10,000 queries with an average latency of 8.2 seconds and 94.3% semantic accuracy. We describe the architecture, present empirical results from production deployment, and discuss practical considerations for deploying LLM-based analytics systems at scale.

研究动机与目标

解决企业 NL2Code 部署中的成本、延迟和领域精准度挑战。
引入具有 LLM 基于等价检测和结构化适配提示的语义缓存以提高缓存利用率。
开发双阈值决策机制，将精确匹配检索与参考引导生成区分开来。
实现意图驱动的动态提示组装系统，在降低 token 使用量的同时保持准确性。
在 10,000+ 查询的生产部署结果中展示量化的延迟和准确性指标。

提出的方法

提出由 LangGraph 协调的多智能体架构，包含 Guard、Intent Classifier、Reference Matcher、Planner、Python、Executor 和 Business Insights Generator 等代理。
定义五个层级结构的 QuerySignature，以捕捉结构化意图用于稳健的缓存匹配。
实现双阈值缓存：精确匹配返回 (s ≥ 0.995) 与引导模式 (0.50 ≤ s < 0.995) 以实现参考引导生成。
在前 k 个缓存候选者上使用基于 LLM 的语义等价检测，产生用于规划者引导的结构化适配。
采用意图驱动的动态提示组装，通过识别表格和领域术语来筛选提示，将 token 数量降低 40-60%。
使用生产部署数据进行评估，指标包括语义准确性、缓存命中/引导率、延迟、tokens 和成本。

实验结果

研究问题

RQ1基于 LLM 等价检测的语义缓存对企业 NL2Code 工作负载的有效性如何？
RQ2双阈值缓存策略能否在生产 NL2Code 系统中平衡准确性与成本？
RQ3意图驱动的提示组装是否显著减少 token 使用量且不牺牲准确性？
RQ4在真实企业查询中的生产性能特征（延迟、准确性、缓存利用率）是什么？

主要发现

Metric	Value
Semantic Accuracy	94.3%
Cache Return Rate	23.1%
Cache Guide Rate	44.2%
Total Cache Utilization	67.3%
Average Latency (all queries)	8.2s
Average Tokens per Query	32,450
Average Cost per Query	$0.0089

语义准确性在生产查询上达到 94.3%。
总缓存利用率达到 67.3%。
缓存返回率 (s ≥ 0.995) 为 23.1%。
缓存引导率 (s ≥ 0.50) 为 44.2%。
所有查询的平均延迟为 8.2 秒；缓存返回 2.1 秒；新生生成 16.4 秒。
平均每次查询的 token 数为 32,450，平均每次查询成本为 0.0089 美元。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。