QUICK REVIEW

[論文レビュー] IDRBench: Interactive Deep Research Benchmark

Feng, Yingchaojie, Qiang Huang|arXiv (Cornell University)|Jan 10, 2026

Topic Modeling被引用数 0

ひとこと要約

IDRBench は LLMs との対話的深層研究を評価する初のベンチマークであり、モジュラ型のマルチエージェント研究フレームワーク全体にわたるユーザー主導の対話の利点とコストを測定します。

ABSTRACT

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

研究の動機と目的

深く未定義で時間とともに進化する研究タスクにおけるヒトとAI の一貫性の持続を促進する。
動的な明確化と指示を可能にする明示的な対話メカニズムを備えたモジュラー型マルチエージェント研究フレームワークを提案する。
大規模で再現可能な評価を可能にする、リファレンスに基づくユーザー・シミュレーターを提供する。
対話を考慮した評価スイートを開発し、品質・カバレッジ・一貫性といった利点と、ターン・トークンといったコストを同時に評価する。

提案手法

LangChain-AI 上に構築された四エージェント・アーキテクチャ（Planner、Supervisor、Researcher、Reporter）を導入し、計画、研究、生成を分解する。
不確実時に実行を一時停止し指示を仰ぐための Clarification および User Feedback モジュールを組み込み、対話機構を取り入れる。
リファレンスに基づく User Simulator を使用して、ソース文書に根ざしたスケーラブルな目標指向のフィードバックを提供する。
詳細なプロンプトを圧縮して未指定の質問を模擬する Ambiguity Injection プロセスを構築する。
自律設定と対話設定の両方で、7 つの代表的 LLMs（商用・オープンウェイト問わず）を評価する。
意味的整合性、多層的なカバレッジ、意図の充足といった指標と、対話コスト（turns and tokens）を組み合わせた対話意識評価スイートを適用する。

Figure 1: Comparison of autonomous and interactive deep research. Autonomous agents execute independently and may diverge from user intent, while interactive agents incorporate feedback to maintain alignment.

実験結果

リサーチクエスチョン

RQ1対話フィードバックを取り入れることは、さまざまな LLMs において研究品質とユーザーの整合性を改善するか？
RQ2対話の利点は、モデルタイプや段階に応じて対話コスト（ターン・トークン）とどうトレードオフするか？
RQ3対話のタイミング（計画、研究ループ、生成）は性能向上にどう影響するか？
RQ4対話の利点は、異なるユーザー・シミュレーターやあいまいなプロンプト生成に対してどれだけ頑健か？

主な発見

Model	Interaction Mode	Report Similarity	Sentence	Paragraph	Chunk	LLM-ACS	Average Score	Est. API Cost ($/Report)
GPT-5.1	Autonomous	84.92	46.05	69.07	82.30	95.61	75.59	0.473
GPT-5.1	Interactive	87.54	50.44	71.99	88.08	96.79	78.97	0.586
Difference	-	+2.62	+4.39	+2.92	+5.78	++1.18	++3.38	+0.113
Gemini-2.5-Pro	Autonomous	85.00	38.36	76.62	80.92	86.37	73.45	0.393
Gemini-2.5-Pro	Interactive	88.88	46.60	82.15	89.21	92.60	79.89	0.752
Difference	-	+8.24	+5.53	++8.29	++6.23	++6.43	++0.359
Claude-Sonnet-4.5	Autonomous	85.96	44.98	69.20	81.52	95.88	75.51	0.987
Claude-Sonnet-4.5	Interactive	89.15	52.92	74.20	88.06	98.00	80.47	2.220
Difference	-	+7.94	++5.00	++5.00	++6.54	++2.12	++4.96	++1.233
Grok-4.1-Fast	Autonomous	81.28	30.76	65.33	72.93	87.44	67.55	0.192
Grok-4.1-Fast	Interactive	86.68	38.63	76.47	83.24	92.56	75.52	0.275
Difference	-	+7.87	++7.87	++11.14	++10.31	++5.12	++7.97	++0.083
Llama-4-Maverick	Autonomous	76.06	18.44	64.72	61.78	53.06	54.81	0.021
Llama-4-Maverick	Interactive	83.93	24.65	78.46	75.31	66.53	65.78	0.026
Difference	-	+7.87	++6.21	++13.74	++13.53	++13.47	++10.96	++0.005
Qwen3-235B	Autonomous	79.76	28.19	61.03	69.00	81.84	63.96	0.139
Qwen3-235B	Interactive	82.83	32.81	65.14	75.89	91.70	69.67	0.133
Difference	-	+3.07	+4.62	++4.11	++6.89	++9.86	++5.71	-0.006
DeepSeek-V3.2	Autonomous	84.32	37.94	73.65	80.73	90.09	73.35	0.146
DeepSeek-V3.2	Interactive	88.11	44.93	79.47	87.13	93.54	78.64	0.185
Difference	-	+3.79	+6.99	++5.82	++6.40	++3.45	++5.29	++0.039

対話は評価対象のモデル全体でレポートの品質と整合性を一貫して向上させる。
一部のモデルでは、対話の利得がモデル容量の向上による利得に匹敵するかそれを超える。
低容量モデルは高容量モデルより対話の恩恵を受けやすく、大規模モデルには収益の逓減が見られる。
早期段階の対話（計画段階）は後半の介入より大きな利益を生み、ライフサイクル全体の対話は全体的な性能が最も良い。
対話は極端な失敗を減らし、モデル間の頑健性を向上させる。
DeepSeek-V3.2 のようなオープンウェイトモデルは、対話が効果的に活用される場合、より高容量モデルを上回ることがある。

Figure 2: Overview of IDRBench . The benchmark integrates an interactive deep research framework with curated data construction, representative LLMs, and interaction-aware evaluation. It features a multi-agent pipeline for Planning , Research Loop , and Generation , augmented with an interaction mec

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。