[论文解读] IDRBench: Interactive Deep Research Benchmark
IDRBench 是首个评估与大语言模型(LLMs)进行交互式深度研究的基准,衡量在一个模块化的多代理研究框架中用户引导交互的收益与成本。
Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.
研究动机与目标
- 在深度研究任务中(这些任务不充分定义且随时间演化),推动持续的人机对齐。
- 提出一个模块化的多代理研究框架,具备明确的交互机制,以实现动态澄清与引导。
- 提供一个可扩展、以参考为基础的用户仿真器,以实现大规模、可重复的评估。
- 开发一个具备交互感知的评估套件,综合评估收益(质量、覆盖、对齐)与成本(轮次、令牌)。
提出的方法
- 引入一个基于 LangChain-AI 的四代理架构(Planner、Supervisor、Researcher、Reporter),用于分解规划、研究与生成。
- 嵌入一个具备澄清与用户反馈模块的交互机制,在不确定时暂停执行并寻求指导。
- 使用以参考为基础的用户仿真器,提供可扩展、面向目标、基于源文档的反馈。
- 构建一个歧义注入过程,通过压缩详细提示来模拟不充分定义的查询。
- 在独立与交互两种设定下评估七种具有代表性的 LLMs(包括专有与开源权重模型)。
- 应用具备交互感知的评估套件,包含语义对齐、多粒度覆盖、意图满足等度量,以及交互成本(轮次与令牌)的评估。

实验结果
研究问题
- RQ1将交互反馈纳入是否能在多种模型中提升研究质量与用户对齐度?
- RQ2交互收益如何在模型类型与阶段间与交互成本(轮次与令牌)权衡?
- RQ3交互时机(规划、研究循环、生成)如何影响性能增益?
- RQ4不同用户仿真器与歧义提示生成对交互收益的鲁棒性有多大影响?
主要发现
| Model | Interaction Mode | Report Similarity | Sentence | Paragraph | Chunk | LLM-ACS | Average Score | Est. API Cost ($/Report) |
|---|---|---|---|---|---|---|---|---|
| GPT-5.1 | Autonomous | 84.92 | 46.05 | 69.07 | 82.30 | 95.61 | 75.59 | 0.473 |
| GPT-5.1 | Interactive | 87.54 | 50.44 | 71.99 | 88.08 | 96.79 | 78.97 | 0.586 |
| Difference | - | +2.62 | +4.39 | +2.92 | +5.78 | ++1.18 | ++3.38 | +0.113 |
| Gemini-2.5-Pro | Autonomous | 85.00 | 38.36 | 76.62 | 80.92 | 86.37 | 73.45 | 0.393 |
| Gemini-2.5-Pro | Interactive | 88.88 | 46.60 | 82.15 | 89.21 | 92.60 | 79.89 | 0.752 |
| Difference | - | +8.24 | +5.53 | ++8.29 | ++6.23 | ++6.43 | ++0.359 | |
| Claude-Sonnet-4.5 | Autonomous | 85.96 | 44.98 | 69.20 | 81.52 | 95.88 | 75.51 | 0.987 |
| Claude-Sonnet-4.5 | Interactive | 89.15 | 52.92 | 74.20 | 88.06 | 98.00 | 80.47 | 2.220 |
| Difference | - | +7.94 | ++5.00 | ++5.00 | ++6.54 | ++2.12 | ++4.96 | ++1.233 |
| Grok-4.1-Fast | Autonomous | 81.28 | 30.76 | 65.33 | 72.93 | 87.44 | 67.55 | 0.192 |
| Grok-4.1-Fast | Interactive | 86.68 | 38.63 | 76.47 | 83.24 | 92.56 | 75.52 | 0.275 |
| Difference | - | +7.87 | ++7.87 | ++11.14 | ++10.31 | ++5.12 | ++7.97 | ++0.083 |
| Llama-4-Maverick | Autonomous | 76.06 | 18.44 | 64.72 | 61.78 | 53.06 | 54.81 | 0.021 |
| Llama-4-Maverick | Interactive | 83.93 | 24.65 | 78.46 | 75.31 | 66.53 | 65.78 | 0.026 |
| Difference | - | +7.87 | ++6.21 | ++13.74 | ++13.53 | ++13.47 | ++10.96 | ++0.005 |
| Qwen3-235B | Autonomous | 79.76 | 28.19 | 61.03 | 69.00 | 81.84 | 63.96 | 0.139 |
| Qwen3-235B | Interactive | 82.83 | 32.81 | 65.14 | 75.89 | 91.70 | 69.67 | 0.133 |
| Difference | - | +3.07 | +4.62 | ++4.11 | ++6.89 | ++9.86 | ++5.71 | -0.006 |
| DeepSeek-V3.2 | Autonomous | 84.32 | 37.94 | 73.65 | 80.73 | 90.09 | 73.35 | 0.146 |
| DeepSeek-V3.2 | Interactive | 88.11 | 44.93 | 79.47 | 87.13 | 93.54 | 78.64 | 0.185 |
| Difference | - | +3.79 | +6.99 | ++5.82 | ++6.40 | ++3.45 | ++5.29 | ++0.039 |
- 交互在所有评估的模型中都能稳定提升报告质量与对齐度。
- 对某些模型而言,交互收益可与或超过提升模型容量带来的收益。
- 低容量模型通常比高容量模型从交互中获得更大收益(对大模型回报递减)。
- 早期阶段的交互(规划)比后期干预带来更大增益,全生命周期交互提供最佳整体性能。
- 交互减少极端失败并提高模型的鲁棒性。
- 像 DeepSeek-V3.2 这样的开权重模型在交互有效利用时可超越更高容量的模型。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。