QUICK REVIEW

[论文解读] Agentic SPARQL: Evaluating SPARQL-MCP-powered Intelligent Agents on the Federated KGQA Benchmark

Daniel Dobriy, Frederik Bauer|arXiv (Cornell University)|Jan 20, 2026

Semantic Web and Ontologies被引用 0

一句话总结

本论文为代理式联邦知识图谱问答引入 SPARQL-MCP，构建了一个联合的 FK GQA 基准，并在三种设置下评估了 LLM 代理（GPT-5.2 和 Qwen3-8B），结果显示 GPT-5.2 的准确率在 42.1%–45.4% 之间，强调了端点高层描述对性能的影响。

ABSTRACT

Standard protocols such as the Model Context Protocol (MCP) that allow LLMs to connect to tools have recently boosted "agentic" AI applications, which, powered by LLMs' planning capabilities, promise to solve complex tasks with the access of external tools and data sources. In this context, publicly available SPARQL endpoints offer a natural connection to combine various data sources through MCP by (a) implementing a standardised protocol and query language, (b) standardised metadata formats, and (c) the native capability to federate queries. In the present paper, we explore the potential of SPARQL-MCP-based intelligent agents to facilitate federated SPARQL querying: firstly, we discuss how to extend an existing Knowledge Graph Question Answering benchmark towards agentic federated Knowledge Graph Question Answering (FKGQA); secondly, we implement and evaluate the ability of integrating SPARQL federation with LLM agents via MCP (incl. endpoint discovery/source selection, schema exploration, and query formulation), comparing different architectural options against the extended benchmark. Our work complements and extends prior work on automated SPARQL query federation towards fruitful combinations with agentic AI.

研究动机与目标

将 KGQA 基准扩展到代理式联邦 KGQA（FKGQA）。
开发一个 SPARQL-MCP 服务器，实现端点发现、模式探索与联邦化。
在前沿大模型下评估代理式 SPARQL，覆盖多种体系结构设置。
分析模型行为、端点发现模式，以及在联邦场景中的查询效率。

提出的方法

提出用于联邦查询的 SPARQL-MCP 扩展，支持动态端点探索与 VoID 元数据处理。
整合代理联邦引擎以管理多服务 SERVICE 调用并应对端点阻塞。
将 Spider4SPARQL 扩展为一个联邦 KGQA 基准，采用垂直、基于类别的、水平分片分区。
使用 ReAct 风格的代理与 MSP（MCP）工具链，评估三种代理式设置（基线、高层端点描述、void_tool VoID 检索）。
在 GPT-5.2 与 Qwen3-8B 上衡量句法有效性、流水线准确性、端点准确性和行为模式。

实验结果

研究问题

RQ1代理式 SPARQL 代理能否从自然语言问题中自动发现端点、探索模式并形成联邦 SPARQL 查询？
RQ2端点发现和模式探索策略如何影响联邦 KGQA 的准确性与效率？
RQ3在代理式 SPARQL 任务中，高容量模型（GPT-5.2）与较小模型（Qwen3-8B）的性能差异为何？
RQ4提供高层端点描述是否改善源选择并减少不必要的联邦？

主要发现

GPT-5.2 的基线准确率为 42.1%，高层为 45.4%，void_tool 为 43.5%，在 Spider4SPARQL 的联邦复杂性下仍与现有方法相当。
Qwen3-8B 的基线为 13.1%，高层为 13.2%，void_tool 为 13.8%，显著低于 GPT-5.2。
所有运行的句法成功率为 75.7%（38,886 条中 29,431 条），GPT-5.2 的区间为 97.4%–98.0%，Qwen 的区间为 41.5%–61.1%。
GPT-5.2 在基线（90.7%）和 void_tool（91.7%）中端点咨询率高，但在高层（25.8%）显著较低，Qwen-8B 在 void_tool 中端点成功率达到 98.6%。
大多数 GPT-5.2 的查询在基线阶段属于微小联邦（90.2%–91.7%），在高层降至 11.0%；Qwen-8B 仍然高度微小（68.5%–98.6%）。
实现的联邦平均触达 4.84 个分片，且有 24.49% 的查询恰好匹配一个分片，跨数据集的平均发散度为 6.48 分片（最少 2，最多 14）。
VoID 检索调用在三种设置中大约为每轮 1.0–1.1 次，端到端运行时间中位数约为 16.3–31.9 秒，视模型而定。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。