QUICK REVIEW

[논문 리뷰] From Intent to Evidence: A Categorical Approach for Structural Evaluation of Deep Research Agents

Shuoling Liu, Zhiquan Tan|arXiv (Cornell University)|2026. 03. 26.

Machine Learning in Materials Science인용 수 0

한 줄 요약

논문은 Deep Research Agents(DRA)를 범주 이론으로 형식화하고 296-question 벤치마크를 도입하여 네 축에 걸친 DRA 구조 보존 능력을 스트레스 테스트하고, 다중 홉 구조 합성의 강력한 한계를 드러낸다.

ABSTRACT

Although deep research agents (DRAs) have emerged as a promising paradigm for complex information synthesis, their evaluation remains constrained by ad hoc empirical benchmarks. These heuristic approaches do not rigorously model agent behavior or adequately stress-test long-horizon synthesis and ambiguity resolution. To bridge this gap, we formalize DRA behavior through the lens of category theory, modeling deep research workflow as a composition of structure-preserving maps (functors). Grounded in this theoretical framework, we introduce a novel mechanism-aware benchmark with 296 questions designed to stress-test agents along four interpretable axes: traversing sequential connectivity chains, verifying intersections within V-structure pullbacks, imposing topological ordering on retrieved substructures, and performing ontological falsification via the Yoneda Probe. Our rigorous evaluation of 11 leading models establishes a persistently low baseline, with the state-of-the-art achieving only a 19.9\% average accuracy, exposing the difficulty of formal structural stress-testing. Furthermore, our findings reveal a stark dichotomy in the current AI capabilities. While advanced deep research pipelines successfully redefine dynamic topological re-ordering and exhibit robust ontological verification -- matching pure reasoning models in falsifying hallucinated premises -- they almost universally collapse on multi-hop structural synthesis. Crucially, massive performance variance across tasks exposes a lingering reliance on brittle heuristics rather than a systemic understanding. Ultimately, this work demonstrates that while top-tier autonomous agents can now organically unify search and reasoning, achieving a generalized mastery over complex structural information remains a formidable open challenge.\footnote{Our implementation will be available at https://github.com/tzq1999/CDR.

연구 동기 및 목표

임의의 벤치마크를 넘어 DRAs에 대해 이론에 기반한 엄격한 평가의 필요성을 동기화한다.
DRA 동작과 상태 공간의 범주 이론적 형식을 도입한다.
장기적 합성 및 애매성 해소를 스트레스 테스트하기 위한 메커니즘 인식 벤치마크를 제안한다.
여러 모델에 걸친 에이전트 성능을 정량화하여 구조적 강점과 약점을 밝힌다.

제안 방법

DRA 동작을 범주 상태 공간(질의, 웹, 검색된 부분그래프 및 추론) 간의 구조 보존 펀터의 시퀀스로 모델링한다.
검증 및 집계 작업을 포착하기 위한 정확한 범주론적 개념(풀백, 극한/공극)을 정의한다.
일련의 연결성, V-구조 교차, 부분구조 정렬, Yoneda 탐사를 통한 온톨로지 위조의 네 축으로 구성된 296문항 벤치마크를 설계한다.
사람이 검증한 평가 파이프라인을 사용하여 추론, 검색 보강 및 자율 DRA 패러다임에 걸쳐 11개의 선도 모델을 평가한다.

실험 결과

연구 질문

RQ1범주 이론적 추상화가 DRA의 검색 및 추론 워크플로를 충실히 모델링할 수 있는가?
RQ2현재 모델이 검색 및 추론 작업 중 구조 관계를 (펀터를 통해) 얼마나 잘 보존하는가?
RQ3장기적 합성 및 애매성 해소에서 DRA의 주요 실패 모드는 무엇인가?
RQ4DRA가 강력한 온톨로지적 검증을 보이는가 아니면 작업 간에 취약한 휴리스틱에 의존하는가?
RQ5제안된 네 가지 범주 평가 축에 걸쳐 성능은 어떻게 달라지는가?

주요 결과

Benchmark	Sequential Tracing (Chains)	Multi-Source Synthesis (Pullbacks)	Substructure Disentanglement (Re-ordering)	Ontological Probing (Yoneda)	Theory-Based
Theory-Based	;; BrowseComp	✗	✗	✗	✗
WebShaper	✓	✓	✗	✗	✓
DeepResearch Bench	✓	✓	✗	✗	✗
Finance Agent Benchmark	✓	✓	✗	✗	✗
FinSearchComp	✓	✓	✗	✗	✗
Ours	✓	✓	✓	✓	✓

최첨단 모델은 벤치마크에서 평균 정확도 19.9%에 불과하다.
고급 DRA 파이프라인은 동적 토폴로지 재정렬 및 온톨로지적 검증에서 강점을 보이며, 환각된 전제를 거짓으로 입증하는 데 순수 추론 모델과 비슷하다.
모델은 일반적으로 다중 홉 구조 합성에서 실패하고 특정 수학적 제약 하에서 맹점이 나타난다.
작업과 모델 간에 성능 편차가 크며 체계적 이해보다 휴리스틱에 의존함을 나타낸다.
연구는 DRAs를 통해 복잡한 구조 정보를 일반화된 통달로 달성하는 것이 여전히 중요한 미해결 과제임을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.