QUICK REVIEW

[論文レビュー] Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Yuhao Zhan, Tianyu Fan|arXiv (Cornell University)|Jan 30, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

この論文は、深層研究エージェント（DRAs）の幻覚を計画-探索-要約の全軌跡に渡って監査することで、プロセス認識型評価フレームワークを導入し、PING分類法とDeepHalluBenchを提案して失敗の根本原因を診断する。

ABSTRACT

Diagnosing the failure mechanisms of Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring critical intermediate hallucinations, such as flawed planning, that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to process-aware evaluation by auditing the full research trajectory. We introduce the PIES Taxonomy to categorize hallucinations along functional components (Planning vs. Summarization) and error properties (Explicit vs. Implicit). We instantiate this taxonomy into a fine-grained evaluation framework that decomposes the trajectory to rigorously quantify these hallucinations. Leveraging this framework to isolate 100 distinctively hallucination-prone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six state-of-theart DRAs reveal that no system achieves robust reliability. Furthermore, our diagnostic analysis traces the etiology of these failures to systemic deficits, specifically hallucination propagation and cognitive biases, providing foundational insights to guide future architectural optimization. Data and code are available at https://github.com/yuhao-zhan/DeepHalluBench.

研究の動機と目的

幻覚を最終結果だけでなく、計画-探索-要約という研究全体の軌跡全体で診断する必要性を動機づける。
DRAsにおける幻覚を分類する分類法を提案し、詳細な監査を可能にする。
幻覚が生じやすいタスクを含むベンチマーク（DeepHalluBench）を作成し、DRAsをストレステストする。
幻覚に寄与するDRAsの体系的な欠陥を特定し、アーキテクチャの改善の指針を提供する。

提案手法

幻覚を4つのタイプに分類するPING分類法を提案する：Propagation、Intent、Noise-induced、Grounding。
軌跡を原子アクション、主張、サブクエリの検証用に分解する詳細な評価フレームワークへ分類法を具現化する。
100個の独特な幻覚-proneタスクを含むストレステストセット（DeepHalluBench）を整理する。
ベンチマークに対して6つの代表的なDRAsで実験を実施し、幻覚-proneなパフォーマンスを評価する。
診断結果を分析して伝播と認知バイアスに起因する失敗を特定し、アーキテクチャの改善を導く。

実験結果

リサーチクエスチョン

RQ1DRAsの全研究軌跡における主な幻覚的失敗モードは何か。
RQ2プロセス認識型評価フレームワークはエンドツーエンドの指標が見逃す中間の幻覚を明らかにできるか。
RQ3PING分類法は実務でDRAの幻覚を分類する際にどれだけ効果的か。
RQ4DRAsの幻覚伝播に最も寄与するアーキテクチャ的または認知的バイアスは何か。

主な発見

DRAsは幻覚-proneなストレステストセットで有意な信頼性ギャップを示す。
プロセス認識型監査フレームワークはエンドツーエンド指標が見逃す中間の幻覚を明らかにする。
PING分類法は幻覚を伝播、意図、ノイズ誘発、グラウンディングのカテゴリへ効果的に分解する。
幻覚の伝播と認知バイアスがDRAの失敗の主要因である。
DeepHalluBenchはDRA間で幻覚耐性を診断・比較するためのターゲット型ベンチマークを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。