QUICK REVIEW

[论文解读] Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge

Mohammad R. Rezaei, Reza Saadati Fard|ArXiv.org|Feb 18, 2025

Topic Modeling被引用 3

一句话总结

AMG-RAG 自动化构建与持续更新医学知识图谱，以辅助大语言模型的医疗问答，在较小模型下实现对 MEDQA 与 MEDMCQA 的强劲表现。

ABSTRACT

Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Agentic Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.

研究动机与目标

在快速知识演变中保持医疗问答的最新性挑战。
实现医学知识图谱（MKG）的自动构建与持续更新。
将MKG与RAG和链式推理结合，提升医学领域的问答性能。
在 MEDQA 和 MEDMCQA 基准测试上对框架进行评估。
在不增加推理开销的前提下展示效率。

提出的方法

提出 AMG-RAG：一个迭代管线，通过LLM智能体与医学检索工具构建MKG。
将医学术语表示为 KG 节点并以置信分数推断关系。
使用 BFS/DFS 在置信阈值下探索KG，为每个实体生成思维链。
将MKG推理与RAG及外部证据检索（PubMedSearch、WikiSearch）结合。
在 MEDQA（F1）和 MEDMCQA（准确率）上对比更大模型进行评估。
将MKG存储在 Neo4j 中，并提供带置信评分的双向关系。

实验结果

研究问题

RQ1自动化的MKG构建与动态更新如何提升医疗问答的准确性与可靠性？
RQ2将CoT推理与外部检索集成到基于KG的问答对标准医疗基准的影响？
RQ3在MKG与检索工具辅助下，较小模型（≈8B 参数）是否能超越更大模型在 MEDQA 和 MEDMCQA 上的表现？
RQ4置信分数与图遍历策略如何影响答案质量与可解释性？

主要发现

Model	Model Size	Acc. (%)	F1 (%)	Fine-Tuned	Uses CoT	Uses Search
Med-Gemini	≈1800B	91.1	89.5	✓	✓	✓
GPT-4	≈1760B	90.2	88.7	✓	✓	✓
Med-PaLM 2	≈340B	85.4	82.1	✓	✓	✗
Med-PaLM 2 (5-shot)	≈340B	79.7	75.3	✗	✓	✗
AMG-RAG	≈8B	73.9	74.1	✗	✓	✓
Meerkat	≈7B	74.3	70.4	✓	✓	✗
Meditron	≈70B	70.2	68.3	✓	✓	✓
Flan-PaLM	≈540B	67.6	65.0	✓	✓	✗
LLAMA-2	≈70B	61.5	60.2	✓	✓	✗
Shakti-LLM	≈2.5B	60.3	58.9	✓	✗	✗
Codex 5-shot CoT	–	60.2	57.7	✗	✓	✓
BioMedGPT	≈10B	50.4	48.7	✓	✗	✗
BioLinkBERT (base)	–	40.0	38.4	✓	✗	✗
(Table 2) MedMCQA models - AMG-RAG	≈8B	66.34	–	–	–	–
Meditron (70B)	≈70B	66.0	–	–	–	–
Codex 5-shot	–	59.7	–	–	–	–
VOD	–	58.3	–	–	–	–
Flan-PaLM	≈540B	57.6	–	–	–	–
PaLM	≈540B	54.5	–	–	–	–
GAL	≈120B	52.9	–	–	–	–
PubMedBERT	–	40.0	–	–	–	–
SciBERT	–	39.0	–	–	–	–
BioBERT	–	38.0	–	–	–	–
BERT	–	35.0	–	–	–	–

AMG-RAG 在 MEDQA 上获得 74.1% 的 F1，在 MEDMCQA 上获得 66.34% 的准确率，优于可比模型以及那些大于其十到百倍的模型。
在≈8B 参数下的 AMG-RAG 能达到或超过若干更大基线模型，而无需微调或提高推理成本。
加入 PubMedSearch 与 WikiSearch 能提升性能，在 MEDQA 实验中 PubMedSearch 的表现优于 WikiSearch。
去除 CoT 或 KG 集成会显著降低准确性与 F1，强调结构化推理与领域检索的重要性。
MKGs 能基于查询动态构建并结合外部证据进行更新，从而实现 up-to-date 的领域特定推理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。