Skip to main content
QUICK REVIEW

[論文レビュー] RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Aswini Sivakumar, Vijayan Sugumaran|arXiv (Cornell University)|Mar 3, 2026
Topic Modeling被引用数 0
ひとこと要約

tldr: RAG-X は医療 RAG 系統における retrieval と generation を分離する診断フレームワークを提示し、Context Utilization Efficiency(CUE)を用いて回答の grounding と deception を暴露し、“Accuracy Fallacy”を露呈します。

ABSTRACT

Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14\% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.

研究の動機と目的

  • Motivate safe deployment of medical RAG by diagnosing component-level failures.
  • Decouple evaluation of retriever and generator to identify grounding vs. generation errors.
  • Introduce Context Utilization Efficiency (CUE) to categorize outputs into actionable diagnostic quadrants.
  • Provide diagnostics across information extraction, short-answer generation, and MCQ answering using medical datasets.

提案手法

  • Extend the standard RAG pipeline with a medical normalization layer for preprocessing.
  • Use a hybrid retrieval approach combining BM25 lexical matching with semantic vector search." α controls the balance between sparse and dense retrieval.
  • Define and compute retrieval diagnostics (ranking metrics, LLM-based context relevancy, fine-grained retrieval signals).
  • Define and compute generation diagnostics (surface-level similarity, semantic similarity, structured output measures, and LLM-based judgment).
  • Introduce Context Utilization Efficiency (CUE) to map retriever and generator performance into four diagnostic quadrants (Effective Use, Information Blindness, Hallucination/Lucky Guess, Correct Rejection).
  • Evaluate on three medical QA benchmarks with diverse modalities and associated knowledge bases.

実験結果

リサーチクエスチョン

  • RQ1Can RAG-X accurately diagnose whether errors stem from retrieval or generation in medical QA tasks?
  • RQ2Do CUE quadrants reveal hidden grounding issues not captured by aggregate accuracy metrics?
  • RQ3How do different retriever configurations impact coverage, redundancy, and exclusivity of retrieved evidence in medical domains?
  • RQ4What actionable bottlenecks emerge in information extraction, short-answer generation, and MCQ answering under RAG in medicine?

主な発見

  • RAG-X reveals an Accuracy Fallacy where high overall accuracy masks lack of evidence-based grounding.
  • On best pipelines, 22.0% of retrieved contexts are redundant and 6.8% of top-rank contexts are exclusive sources of evidence.
  • There is a 14% gap between accuracy and evidence-based grounding, with 33.9% of responses being grounded only by “lucky guesses.”
  • Context Utilization Efficiency (CUE) categorizes outputs into four quadrants, exposing grounded successes and non-grounded yet correct-looking answers.
  • Across three clinical datasets, standard accuracy/F1 metrics can misrepresent real grounding and retrieval quality without component-level diagnostics.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。