Skip to main content
QUICK REVIEW

[论文解读] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque, Aasar Mehdi|arXiv (Cornell University)|Mar 18, 2026
Topic Modeling被引用 0
一句话总结

提出一个四阶段的领域基锚定的分层检索与验证管道(内在验证、适应域检索、精炼上下文过滤、外部再生成)以减少 LLM 编产错觉;在五个基准的 650 个查询上评估,展现强势胜率与 grounding 得分。

ABSTRACT

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Refined Context Filtering (RCF) to eliminate non-essential or distracting information, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.

研究动机与目标

  • 通过以领域特定且经验证的外部来源对生成进行锚定,减少 LLM 输出的幻觉。
  • 通过自我调节的多阶段检索与验证体系提升事实可靠性。
  • 通过内在验证和早退出实现计算优化,在可能时跳过检索。
  • 在多样化基准上评估该方法,以量化 grounding 与错误模式。

提出的方法

  • 在 LangGraph 中实现的四阶段管线:具备早退出的内在验证;使用领域检测器的自适应检索路由; refined 上下文过滤以去除噪声;带原子性断言验证的外部再生成。
  • 先使用零-shot 内部生成;若置信度不足,系统路由到可信的领域特定来源再进行通用网络检索。
  • 通过校正文档评分器对外部数据的相关性与可靠性进行打分;再生成的答案拆分为原子性断言以进行验证。
  • 最终验证在检索到的证据上执行原子性断言级别检查,并设定断路器在验证失败时表示道歉。
  • 使用 Llama 3.1 8B 处理内部任务、Tavily API 进行检索、Gemma3 27B 作为判断者、LangGraph 进行多阶段图式工作流。

实验结果

研究问题

  • RQ1领域基锚定的分层检索在不同事实任务中相对于零-shot 基线能降低多少幻觉?
  • RQ2在领域基锚定设置下,多阶段 RAG 系统的主要失效模式是什么,如何缓解?
  • RQ3内在验证与外部验证的平衡如何影响事实生成的延迟与准确性?
  • RQ4对可信来源的自适应路由在时效性强或数值精确查询中的 grounding 提高程度如何?

主要发现

BenchmarkNProposed WinsTieBaseline WinsWin RateHallucinationGroundedness
TimeQA v286*7210483.7%13.6%86.4%
MMLU Global Facts50398378.0%33.1%66.9%
FreshQA v215097371664.7%3.5%19.2%
TruthfulQA15082561254.7%15.1%84.9%
HaluEval General15075453050.0%21.2%78.8%
Combined (650)5863652206565%--
  • 该管线在五个基准上均优于零-shot 基线,胜率介于 50.0% 至 83.7% 之间。
  • TimeQA v2 以 83.7% 的最高胜率、MMLU Global Facts 为 78.0%。
  • groundedness 分数在事实性答案行之间保持稳健,介于 78.8% 与 86.4% 之间,MMLU 由于测量敏感性成为异常值。
  • groundedness 与幻觉度量在事实性行中表现稳定,但像 HaluEval General 这样的开放域基准收益较有限。
  • 一个显著的失败模式是 False-Premise Overclaiming,表明需要检索前的可回答性检查与改进的拒答策略。
  • 内在暂停将开放域检索使用量降低约 20%,提升了效率。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。