QUICK REVIEW

[论文解读] From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

Shuoling Liu, Zhiquan Tan|arXiv (Cornell University)|Mar 26, 2026

Machine Learning in Materials Science被引用 0

一句话总结

该论文通过范畴理论形式化深度研究代理（DRA），并提出一个296题基准，用于在四个维度上压测DRA结构保持能力，揭示多跳结构综合的显著局限。

ABSTRACT

Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9\% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification -- matching pure reasoning models in falsifying hallucinated premises -- they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.

研究动机与目标

推动对DRA进行严格、理论基础评价的必要性，超越临时性基准测试。
引入DRA行为和状态空间的范畴理论形式化。
提出一种机制感知的基准，用于压测长程合成与歧义解析。
量化多模型下代理的性能，以揭示结构性强项与弱点。

提出的方法

将DRA行为建模为在范畴状态空间（查询、网络、检索子图、推理）之间的保持结构的函子序列。
定义精确的范畴理论概念（拉回、极限/ colimit）以捕捉验证与聚合任务。
设计一个296题的基准，沿四个轴组织：顺序连通性、V-结构交叉、子结构排序，以及通过Yoneda探针进行本体论伪证。
在11个前沿模型上使用人工验证的评估管线，覆盖推理、搜索增强和自主DRA范式的评估。

实验结果

研究问题

RQ1范畴理论的抽象是否能够如实建模DRA的搜索与推理工作流？
RQ2在搜索与推理任务中，当前模型通过函子等保持结构关系的能力有多好？
RQ3在长程合成与歧义解析下，DRA的主要失效模式是什么？
RQ4DRA在本体论验证上是否具有强鲁棒性，还是在任务间依赖易碎的启发式？
RQ5在提出的四个范畴评估轴上，性能如何变化？

主要发现

Benchmark	Sequential Tracing (Chains)	Multi-Source Synthesis (Pullbacks)	Substructure Disentanglement (Re-ordering)	Ontological Probing (Yoneda)	Theory-Based
Theory-Based	;; BrowseComp	✗	✗	✗	✗
WebShaper	✓	✓	✗	✗	✓
DeepResearch Bench	✓	✓	✗	✗	✗
Finance Agent Benchmark	✓	✓	✗	✗	✗
FinSearchComp	✓	✓	✗	✗	✗
Ours	✓	✓	✓	✓	✓

最先进模型在基准测试上的平均准确率仅为19.9%。
先进的DRA流水线在动态拓扑重新排序和本体论验证方面具有优势，能与纯推理模型在证伪虚假前提方面相当。
模型在多跳结构综合方面普遍失效，在某些数学约束下存在盲点。
在任务和模型之间存在较大性能差异，表明依赖启发式而非系统性理解。
研究指出要在DRA上实现对复杂结构性信息的泛化掌握仍是一个重要的开放挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。