QUICK REVIEW

[논문 리뷰] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque, Aasar Mehdi|arXiv (Cornell University)|2026. 03. 18.

Topic Modeling인용 수 0

한 줄 요약

LLM 환각을 줄이기 위해 네 단계 도메인-근거 계층형 검색 및 검증 파이프라인(Intrinsic Verification, Adaptive Domain Retrieval, Refined Context Filtering, Extrinsic Regeneration)을 제안; 다섯 벤치마크의 650 queries에 걸친 평가에서 강한 승률과 근거화 점수를 보임.

ABSTRACT

Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Refined Context Filtering (RCF) to eliminate non-essential or distracting information, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.

연구 동기 및 목표

LLM 출력에서 환각을 도메인별로 검증된 외부 출처에 근거를 두어 감소시키는 것.
다단계 검색 및 검증 아키텍처를 통해 사실성 신뢰성을 높이는 것.
가능한 경우 내재적 검증과 조기 종료를 통해 계산량을 최적화하는 것.
다양한 벤치마크에서 접근법을 평가하여 근거화와 오류 모드를 정량화하는 것.

제안 방법

LangGraph에 구현된 4단계 파이프라인: 조기 종료가 가능한 내부 검증; 도메인 감지기를 통한 적응적 검색 라우팅; 잡음을 제거하기 위한 세부 컨텍스트 필터링; 원자 주장 검증을 위한 외재적 재생성.
제로샷 내부 생성이 먼저 사용되며, 자신감이 충분하지 않으면 시스템은 일반 웹 검색 전 신뢰 가능한 도메인 소스로 라우팅.
외부 데이터는 관련성과 신뢰성 측면에서 수정 문서 채점기로 평가되며 재생성된 답변은 검증을 위해 원자 주장으로 분해됩니다.
최종 검증은 retrieved evidence에 대한 원자 주장 수준의 검사와 검증 실패 시 사과하는 회로 차단기를 사용합니다.
내부 작업에는 Llama 3.1 8B, 검색에 Tavily API, 판단자로 Gemma3 27B, 다중 단계 그래프 워크플로우에 LangGraph를 사용하여 구현되었습니다.

실험 결과

연구 질문

RQ1도메인-근거 계층형 검색이 제로샷 기준선과 비교하여 다양한 사실 작업에서 환각을 얼마나 줄일 수 있는가?
RQ2도메인-근거 설정에서 다단계 RAG 시스템의 주요 실패 모드는 무엇이며 이를 어떻게 완화할 수 있는가?
RQ3내재적 검증 대 외재적 검증의 균형이 사실적 생성의 지연 시간과 정확도에 어떤 영향을 미치는가?
RQ4신뢰 소스로의 적응적 라우팅이 시계열 민감성 또는 숫자적으로 정밀한 질의의 근거 확보에 얼마나 기여하는가?

주요 결과

Benchmark	N	Proposed Wins	Tie	Baseline Wins	Win Rate	Hallucination	Groundedness
TimeQA v2	86*	72	10	4	83.7%	13.6%	86.4%
MMLU Global Facts	50	39	8	3	78.0%	33.1%	66.9%
FreshQA v2	150	97	37	16	64.7%	3.5%	19.2%
TruthfulQA	150	82	56	12	54.7%	15.1%	84.9%
HaluEval General	150	75	45	30	50.0%	21.2%	78.8%
Combined (650)	586	365	220	65	65%	-	-

파이프라인은 다섯 벤치마크 모두에서 제로샷 기준선을 능가하는 승률을 보였으며, 승률은 50.0%에서 83.7%까지였음.
TimeQA v2의 승률이 83.7%로 최고를 기록했고 MMLU Global Facts는 78.0%를 기록함.
근거성 점수는 사실적 답변 행에서 78.8%에서 86.4% 사이로 견고하게 유지되었으며, MMLU는 측정 민감도 때문에 예외로 나타남.
근거화 및 환각 지표는 사실적 행에서 안정적인 성능을 보였으나 HaluEval General과 같은 개방 도메인 벤치마크에서는 이익이 제한적임.
현저한 실패 모드는 False-Premise Overclaiming으로, 차retrieval 전 답변 가능성 점검 및 거절 전략 개선의 필요성을 시사함.
Intrinsic halting으로 개방 도메인 질의의 검색 사용이 약 20% 감소하여 효율성이 향상됨.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.