QUICK REVIEW

[논문 리뷰] IDRBench: Interactive Deep Research Benchmark

Feng, Yingchaojie, Qiang Huang|arXiv (Cornell University)|2026. 01. 10.

Topic Modeling인용 수 0

한 줄 요약

IDRBench는 LLM과의 대화형 심층 연구를 평가하는 첫 번째 벤치마크로, 모듈식 다중 에이전트 연구 프레임워크에서 사용자 주도 상호작용의 혜택과 비용을 모두 측정합니다.

ABSTRACT

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, most existing systems operate in an autonomous manner, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, making sustained interaction essential for robust alignment. Despite its importance, interaction remains largely invisible to existing deep research benchmarks, which neither model dynamic user feedback nor quantify its costs. We introduce IDRBench, the first benchmark for systematically evaluating interactive deep research. IDRBench combines a modular multi-agent research framework with on-demand interaction, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures interaction benefits (quality and alignment) and costs (turns and tokens). Experiments across seven state-of-the-art LLMs show that interaction consistently improves research quality and robustness, often outweighing differences in model capacity, while revealing substantial trade-offs in interaction efficiency.

연구 동기 및 목표

지속적으로 사람-AI 정렬을 촉진하고 명확히 정의되지 않고 시간이 지남에 따라 진화하는 심층 연구 작업을 다룬다.
동적 명확화 및 지침 지시를 가능하게 하는 명시적 상호작용 메커니즘을 갖춘 모듈식 다중 에이전트 연구 프레임워크를 제안한다.
대규모의 재현 가능한 평가를 가능하게 하는 참조 기반의 사용자 시뮬레이터를 제공한다.
상호작용 인식 평가 체계를 개발하여 이점(품질/적합성/커버리지)과 비용(턴/토큰)을 함께 평가한다.

제안 방법

LangChain-AI를 기반으로 계획 수립, 연구, 생성을 분해하는 네 가지 에이전트 아키텍처(Planner, Supervisor, Researcher, Reporter)를 도입한다.
불확실할 때 실행을 일시 중지하고 지침을 구하도록 Clarification 및 User Feedback 모듈이 있는 상호작용 메커니즘을 도입한다.
참조 기반의 사용자 시뮬레이터를 사용하여 원자료에 기반한 확장 가능한 목표 지향적 피드백을 제공한다.
세부 프롬프트를 압축하여 모호한 쿼리를 시뮬레이션하는 Ambiguity Injection 프로세스를 구축한다.
자율적 및 대화형 설정에서 대표적인 7개 LLM(독점형 및 오픈-웨이트)을 평가한다.
의미적 정렬, 다중 세분성 커버리지, 의도 충족도에 대한 메트릭과 더불어 상호작용 비용(턴/토큰)을 포함하는 상호작용 인식 평가 체계를 적용한다.

Figure 1: Comparison of autonomous and interactive deep research. Autonomous agents execute independently and may diverge from user intent, while interactive agents incorporate feedback to maintain alignment.

실험 결과

연구 질문

RQ1상호작용 피드백을 도입하면 다양한 LLM에서 연구 품질과 사용자 정렬이 향상되는가?
RQ2모델 유형과 단계에 따라 상호작용 이점이 상호작용 비용(턴/토큰)과 어떻게 균형을 이루는가?
RQ3상호작용의 타이밍(계획, 연구 루프, 생성)이 성능 향상에 어떤 영향을 미치는가?
RQ4다른 사용자 시뮬레이터와 모호한 프롬프트 생성에 대한 상호작용 이점의 강건성은 어떤가?

주요 결과

모델	상호작용 모드	보고서 유사도	문장	단락	청크	LLM-ACS	평균 점수	추정 API 비용 ($/보고서)
GPT-5.1	자율	84.92	46.05	69.07	82.30	95.61	75.59	0.473
GPT-5.1	대화형	87.54	50.44	71.99	88.08	96.79	78.97	0.586
Difference	-	+2.62	+4.39	+2.92	+5.78	++1.18	++3.38	+0.113
Gemini-2.5-Pro	자율	85.00	38.36	76.62	80.92	86.37	73.45	0.393
Gemini-2.5-Pro	대화형	88.88	46.60	82.15	89.21	92.60	79.89	0.752
Difference	-	+8.24	+5.53	++8.29	++6.23	++6.43	++0.359
Claude-Sonnet-4.5	자율	85.96	44.98	69.20	81.52	95.88	75.51	0.987
Claude-Sonnet-4.5	대화형	89.15	52.92	74.20	88.06	98.00	80.47	2.220
Difference	-	+7.94	++5.00	++5.00	++6.54	++2.12	++4.96	++1.233
Grok-4.1-Fast	자율	81.28	30.76	65.33	72.93	87.44	67.55	0.192
Grok-4.1-Fast	대화형	86.68	38.63	76.47	83.24	92.56	75.52	0.275
Difference	-	+7.87	++7.87	++11.14	++10.31	++5.12	++7.97	++0.083
Llama-4-Maverick	자율	76.06	18.44	64.72	61.78	53.06	54.81	0.021
Llama-4-Maverick	대화형	83.93	24.65	78.46	75.31	66.53	65.78	0.026
Difference	-	+7.87	++6.21	++13.74	++13.53	++13.47	++10.96	++0.005
Qwen3-235B	자율	79.76	28.19	61.03	69.00	81.84	63.96	0.139
Qwen3-235B	대화형	82.83	32.81	65.14	75.89	91.70	69.67	0.133
Difference	-	+3.07	+4.62	++4.11	++6.89	++9.86	++5.71	-0.006
DeepSeek-V3.2	자율	84.32	37.94	73.65	80.73	90.09	73.35	0.146
DeepSeek-V3.2	대화형	88.11	44.93	79.47	87.13	93.54	78.64	0.185
Difference	-	+3.79	+6.99	++5.82	++6.40	++3.45	++5.29	++0.039

상호작용은 평가된 모든 모델에서 보고서 품질과 정합성을 일관되게 향상시킨다.
일부 모델의 경우, 상호작용 이점이 모델 용량 증가로 얻는 이점에 근접하거나 이를 능가한다.
저용량 모델은 일반적으로 상호작용으로 더 큰 이점을 얻으며(대형 모델의 수익 감소), 대형 모델에는 수익이 한계에 다다른다.
초기 단계의 상호작용(계획)이 나중 개입보다 더 큰 이익을 주며, 전체 수명 주기 상호작용이 최고 성능을 제공한다.
상호작용은 극단적 실패를 감소시키고 모델 간의 강건성을 향상시킨다.
DeepSeek-V3.2와 같은 오픈-웨이트 모델이 상호작용이 효과적으로 활용될 때 더 높은 용량의 모델보다 더 나은 성과를 낼 수 있다.

Figure 2: Overview of IDRBench . The benchmark integrates an interactive deep research framework with curated data construction, representative LLMs, and interaction-aware evaluation. It features a multi-agent pipeline for Planning , Research Loop , and Generation , augmented with an interaction mec

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.