QUICK REVIEW

[Paper Review] Semantic Caching and Intent-Driven Context Optimization for Multi-Agent Natural Language to Code Systems

Harmohit Singh|arXiv (Cornell University)|Jan 16, 2026

Natural Language Processing Techniques0 citations

TL;DR

The paper presents a production-optimized multi-agent NL2Code system that uses semantic caching, a dual-threshold cache decision, and intent-driven prompts to convert natural language queries into executable Python, achieving 94.3% semantic accuracy with 8.2s latency on 10k+ queries.

ABSTRACT

We present a production-optimized multi-agent system designed to translate natural language queries into executable Python code for structured data analytics. Unlike systems that rely on expensive frontier models, our approach achieves high accuracy and cost efficiency through three key innovations: (1) a semantic caching system with LLM-based equivalence detection and structured adaptation hints that provides cache hit rates of 67% on production queries; (2) a dual-threshold decision mechanism that separates exact-match retrieval from reference-guided generation; and (3) an intent-driven dynamic prompt assembly system that reduces token consumption by 40-60% through table-aware context filtering. The system has been deployed in production for enterprise inventory management, processing over 10,000 queries with an average latency of 8.2 seconds and 94.3% semantic accuracy. We describe the architecture, present empirical results from production deployment, and discuss practical considerations for deploying LLM-based analytics systems at scale.

Motivation & Objective

Address the challenges of cost, latency, and domain precision in enterprise NL2Code deployments.
Introduce a semantic cache with LLM-based equivalence detection and structured adaptation hints to improve cache utilization.
Develop a dual-threshold decision mechanism to separate exact-match retrieval from reference-guided generation.
Implement an intent-driven dynamic prompt assembly system to reduce token usage while preserving accuracy.
Demonstrate production deployment results across 10,000+ queries with quantified latency and accuracy metrics.

Proposed method

Propose a multi-agent architecture orchestrated by LangGraph with Guard, Intent Classifier, Reference Matcher, Planner, Python, Executor, and Business Insights Generator agents.
Define a QuerySignature with five hierarchical levels to capture structural intent for robust cache matching.
Implement a dual-threshold cache: exact-match returns (s ≥ 0.995) and guide mode (0.50 ≤ s < 0.995) for reference-guided generation.
Use LLM-based semantic equivalence detection on top-k cached candidates to produce structured adaptations for planner-guided adaptation.
Employ intent-driven dynamic prompt assembly that filters prompts by identified tables and domain terms to reduce token counts by 40-60%.
Evaluate across production deployment data using metrics: semantic accuracy, cache hit/guide rates, latency, tokens, and cost.

Experimental results

Research questions

RQ1How effective is semantic caching with LLM-based equivalence detection for enterprise NL2Code workloads?
RQ2Can a dual-threshold cache strategy balance accuracy and cost in production NL2Code systems?
RQ3Does intent-driven prompt assembly significantly reduce token usage without sacrificing accuracy?
RQ4What are the production performance characteristics (latency, accuracy, cache utilization) of the proposed system across real enterprise queries?

Key findings

semantic accuracy achieved 94.3% on production queries.
Total cache utilization reached 67.3%.
Cache Return Rate (s ≥ 0.995) was 23.1%.
Cache Guide Rate (s ≥ 0.50) was 44.2%.
Average latency across all queries was 8.2 seconds; cache returns 2.1s; fresh generation 16.4s.
Average tokens per query were 32,450 and average cost per query was $0.0089.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.