QUICK REVIEW

[论文解读] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque, Aasar Mehdi|arXiv (Cornell University)|Mar 18, 2026

Topic Modeling被引用 0

一句话总结

提出一个四阶段的领域基锚定的分层检索与验证管道（内在验证、适应域检索、精炼上下文过滤、外部再生成）以减少 LLM 编产错觉；在五个基准的 650 个查询上评估，展现强势胜率与 grounding 得分。

ABSTRACT

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Refined Context Filtering (RCF) to eliminate non-essential or distracting information, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.

研究动机与目标

通过以领域特定且经验证的外部来源对生成进行锚定，减少 LLM 输出的幻觉。
通过自我调节的多阶段检索与验证体系提升事实可靠性。
通过内在验证和早退出实现计算优化，在可能时跳过检索。
在多样化基准上评估该方法，以量化 grounding 与错误模式。

提出的方法

在 LangGraph 中实现的四阶段管线：具备早退出的内在验证；使用领域检测器的自适应检索路由； refined 上下文过滤以去除噪声；带原子性断言验证的外部再生成。
先使用零-shot 内部生成；若置信度不足，系统路由到可信的领域特定来源再进行通用网络检索。
通过校正文档评分器对外部数据的相关性与可靠性进行打分；再生成的答案拆分为原子性断言以进行验证。
最终验证在检索到的证据上执行原子性断言级别检查，并设定断路器在验证失败时表示道歉。
使用 Llama 3.1 8B 处理内部任务、Tavily API 进行检索、Gemma3 27B 作为判断者、LangGraph 进行多阶段图式工作流。

实验结果

研究问题

RQ1领域基锚定的分层检索在不同事实任务中相对于零-shot 基线能降低多少幻觉？
RQ2在领域基锚定设置下，多阶段 RAG 系统的主要失效模式是什么，如何缓解？
RQ3内在验证与外部验证的平衡如何影响事实生成的延迟与准确性？
RQ4对可信来源的自适应路由在时效性强或数值精确查询中的 grounding 提高程度如何？

主要发现

Benchmark	N	Proposed Wins	Tie	Baseline Wins	Win Rate	Hallucination	Groundedness
TimeQA v2	86*	72	10	4	83.7%	13.6%	86.4%
MMLU Global Facts	50	39	8	3	78.0%	33.1%	66.9%
FreshQA v2	150	97	37	16	64.7%	3.5%	19.2%
TruthfulQA	150	82	56	12	54.7%	15.1%	84.9%
HaluEval General	150	75	45	30	50.0%	21.2%	78.8%
Combined (650)	586	365	220	65	65%	-	-

该管线在五个基准上均优于零-shot 基线，胜率介于 50.0% 至 83.7% 之间。
TimeQA v2 以 83.7% 的最高胜率、MMLU Global Facts 为 78.0%。
groundedness 分数在事实性答案行之间保持稳健，介于 78.8% 与 86.4% 之间，MMLU 由于测量敏感性成为异常值。
groundedness 与幻觉度量在事实性行中表现稳定，但像 HaluEval General 这样的开放域基准收益较有限。
一个显著的失败模式是 False-Premise Overclaiming，表明需要检索前的可回答性检查与改进的拒答策略。
内在暂停将开放域检索使用量降低约 20%，提升了效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。