[论文解读] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
提出一个四阶段的领域基锚定的分层检索与验证管道(内在验证、适应域检索、精炼上下文过滤、外部再生成)以减少 LLM 编产错觉;在五个基准的 650 个查询上评估,展现强势胜率与 grounding 得分。
Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Refined Context Filtering (RCF) to eliminate non-essential or distracting information, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.
研究动机与目标
- 通过以领域特定且经验证的外部来源对生成进行锚定,减少 LLM 输出的幻觉。
- 通过自我调节的多阶段检索与验证体系提升事实可靠性。
- 通过内在验证和早退出实现计算优化,在可能时跳过检索。
- 在多样化基准上评估该方法,以量化 grounding 与错误模式。
提出的方法
- 在 LangGraph 中实现的四阶段管线:具备早退出的内在验证;使用领域检测器的自适应检索路由; refined 上下文过滤以去除噪声;带原子性断言验证的外部再生成。
- 先使用零-shot 内部生成;若置信度不足,系统路由到可信的领域特定来源再进行通用网络检索。
- 通过校正文档评分器对外部数据的相关性与可靠性进行打分;再生成的答案拆分为原子性断言以进行验证。
- 最终验证在检索到的证据上执行原子性断言级别检查,并设定断路器在验证失败时表示道歉。
- 使用 Llama 3.1 8B 处理内部任务、Tavily API 进行检索、Gemma3 27B 作为判断者、LangGraph 进行多阶段图式工作流。
实验结果
研究问题
- RQ1领域基锚定的分层检索在不同事实任务中相对于零-shot 基线能降低多少幻觉?
- RQ2在领域基锚定设置下,多阶段 RAG 系统的主要失效模式是什么,如何缓解?
- RQ3内在验证与外部验证的平衡如何影响事实生成的延迟与准确性?
- RQ4对可信来源的自适应路由在时效性强或数值精确查询中的 grounding 提高程度如何?
主要发现
| Benchmark | N | Proposed Wins | Tie | Baseline Wins | Win Rate | Hallucination | Groundedness |
|---|---|---|---|---|---|---|---|
| TimeQA v2 | 86* | 72 | 10 | 4 | 83.7% | 13.6% | 86.4% |
| MMLU Global Facts | 50 | 39 | 8 | 3 | 78.0% | 33.1% | 66.9% |
| FreshQA v2 | 150 | 97 | 37 | 16 | 64.7% | 3.5% | 19.2% |
| TruthfulQA | 150 | 82 | 56 | 12 | 54.7% | 15.1% | 84.9% |
| HaluEval General | 150 | 75 | 45 | 30 | 50.0% | 21.2% | 78.8% |
| Combined (650) | 586 | 365 | 220 | 65 | 65% | - | - |
- 该管线在五个基准上均优于零-shot 基线,胜率介于 50.0% 至 83.7% 之间。
- TimeQA v2 以 83.7% 的最高胜率、MMLU Global Facts 为 78.0%。
- groundedness 分数在事实性答案行之间保持稳健,介于 78.8% 与 86.4% 之间,MMLU 由于测量敏感性成为异常值。
- groundedness 与幻觉度量在事实性行中表现稳定,但像 HaluEval General 这样的开放域基准收益较有限。
- 一个显著的失败模式是 False-Premise Overclaiming,表明需要检索前的可回答性检查与改进的拒答策略。
- 内在暂停将开放域检索使用量降低约 20%,提升了效率。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。