QUICK REVIEW

[论文解读] IDRBench: Interactive Deep Research Benchmark

Feng, Yingchaojie, Qiang Huang|arXiv (Cornell University)|Jan 10, 2026

Topic Modeling被引用 0

一句话总结

IDRBench 是首个评估与大语言模型（LLMs）进行交互式深度研究的基准，衡量在一个模块化的多代理研究框架中用户引导交互的收益与成本。

ABSTRACT

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

研究动机与目标

在深度研究任务中（这些任务不充分定义且随时间演化），推动持续的人机对齐。
提出一个模块化的多代理研究框架，具备明确的交互机制，以实现动态澄清与引导。
提供一个可扩展、以参考为基础的用户仿真器，以实现大规模、可重复的评估。
开发一个具备交互感知的评估套件，综合评估收益（质量、覆盖、对齐）与成本（轮次、令牌）。

提出的方法

引入一个基于 LangChain-AI 的四代理架构（Planner、Supervisor、Researcher、Reporter），用于分解规划、研究与生成。
嵌入一个具备澄清与用户反馈模块的交互机制，在不确定时暂停执行并寻求指导。
使用以参考为基础的用户仿真器，提供可扩展、面向目标、基于源文档的反馈。
构建一个歧义注入过程，通过压缩详细提示来模拟不充分定义的查询。
在独立与交互两种设定下评估七种具有代表性的 LLMs（包括专有与开源权重模型）。
应用具备交互感知的评估套件，包含语义对齐、多粒度覆盖、意图满足等度量，以及交互成本（轮次与令牌）的评估。

Figure 1: Comparison of autonomous and interactive deep research. Autonomous agents execute independently and may diverge from user intent, while interactive agents incorporate feedback to maintain alignment.

实验结果

研究问题

RQ1将交互反馈纳入是否能在多种模型中提升研究质量与用户对齐度？
RQ2交互收益如何在模型类型与阶段间与交互成本（轮次与令牌）权衡？
RQ3交互时机（规划、研究循环、生成）如何影响性能增益？
RQ4不同用户仿真器与歧义提示生成对交互收益的鲁棒性有多大影响？

主要发现

Model	Interaction Mode	Report Similarity	Sentence	Paragraph	Chunk	LLM-ACS	Average Score	Est. API Cost ($/Report)
GPT-5.1	Autonomous	84.92	46.05	69.07	82.30	95.61	75.59	0.473
GPT-5.1	Interactive	87.54	50.44	71.99	88.08	96.79	78.97	0.586
Difference	-	+2.62	+4.39	+2.92	+5.78	++1.18	++3.38	+0.113
Gemini-2.5-Pro	Autonomous	85.00	38.36	76.62	80.92	86.37	73.45	0.393
Gemini-2.5-Pro	Interactive	88.88	46.60	82.15	89.21	92.60	79.89	0.752
Difference	-	+8.24	+5.53	++8.29	++6.23	++6.43	++0.359
Claude-Sonnet-4.5	Autonomous	85.96	44.98	69.20	81.52	95.88	75.51	0.987
Claude-Sonnet-4.5	Interactive	89.15	52.92	74.20	88.06	98.00	80.47	2.220
Difference	-	+7.94	++5.00	++5.00	++6.54	++2.12	++4.96	++1.233
Grok-4.1-Fast	Autonomous	81.28	30.76	65.33	72.93	87.44	67.55	0.192
Grok-4.1-Fast	Interactive	86.68	38.63	76.47	83.24	92.56	75.52	0.275
Difference	-	+7.87	++7.87	++11.14	++10.31	++5.12	++7.97	++0.083
Llama-4-Maverick	Autonomous	76.06	18.44	64.72	61.78	53.06	54.81	0.021
Llama-4-Maverick	Interactive	83.93	24.65	78.46	75.31	66.53	65.78	0.026
Difference	-	+7.87	++6.21	++13.74	++13.53	++13.47	++10.96	++0.005
Qwen3-235B	Autonomous	79.76	28.19	61.03	69.00	81.84	63.96	0.139
Qwen3-235B	Interactive	82.83	32.81	65.14	75.89	91.70	69.67	0.133
Difference	-	+3.07	+4.62	++4.11	++6.89	++9.86	++5.71	-0.006
DeepSeek-V3.2	Autonomous	84.32	37.94	73.65	80.73	90.09	73.35	0.146
DeepSeek-V3.2	Interactive	88.11	44.93	79.47	87.13	93.54	78.64	0.185
Difference	-	+3.79	+6.99	++5.82	++6.40	++3.45	++5.29	++0.039

交互在所有评估的模型中都能稳定提升报告质量与对齐度。
对某些模型而言，交互收益可与或超过提升模型容量带来的收益。
低容量模型通常比高容量模型从交互中获得更大收益（对大模型回报递减）。
早期阶段的交互（规划）比后期干预带来更大增益，全生命周期交互提供最佳整体性能。
交互减少极端失败并提高模型的鲁棒性。
像 DeepSeek-V3.2 这样的开权重模型在交互有效利用时可超越更高容量的模型。

Figure 2: Overview of IDRBench . The benchmark integrates an interactive deep research framework with curated data construction, representative LLMs, and interaction-aware evaluation. It features a multi-agent pipeline for Planning , Research Loop , and Generation , augmented with an interaction mec

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。