[论文解读] RAG vs. GraphRAG: A Systematic Evaluation and Key Insights
该论文在通用文本任务(问答和基于查询的摘要)上系统性地比较了RAG和GraphRAG,揭示了互补优势并提出选择与整合策略以将二者结合。
Retrieval-Augmented Generation (RAG) improves large language models (LLMs) by retrieving relevant information from external sources and has been widely adopted for text-based tasks. For structured data, such as knowledge graphs, Graph Retrieval-Augmented Generation (GraphRAG) retrieves and aggregates information along graph structures. More recently, GraphRAG has been extended to general text settings by organizing unstructured text into graph representations, showing promise for reasoning and grounding. Despite these advances, existing GraphRAG systems for text data are often tailored to specific tasks, datasets, and system designs, resulting in heterogeneous evaluation protocols. Consequently, a systematic understanding of the relative strengths, limitations, and trade-offs between RAG and GraphRAG on widely used text benchmarks remains limited. In this paper, we present a comprehensive benchmark study comparing RAG and GraphRAG on established text-based tasks, including question answering and query-based summarization. We introduce a unified evaluation protocol that standardizes data preprocessing, retrieval configurations, and generation settings, enabling fair and reproducible comparisons. Our results highlight the distinct strengths of RAG and GraphRAG across different tasks and evaluation perspectives. Building on these findings, we explore selection and integration strategies that combine the strengths of both paradigms, leading to consistent performance improvements. We further analyze failure modes, efficiency trade-offs, and evaluation biases, and highlight key considerations for designing and evaluating retrieval-augmented generation systems.
研究动机与目标
- 评估在广泛使用的文本基础问答与基于查询的摘要基准上RAG和GraphRAG。
- 分析RAG与GraphRAG的优势、劣势以及任务相关的性能差异。
- 探讨将两种方法结合以提升下游任务的策略。
- 提供对当前GraphRAG局限性与未来方向的洞见。
提出的方法
- 采用代表性的语义相似性基础RAG,使用256-token块和top-10检索,文本嵌入为text-embedding-ada-002。
- 实现两个GraphRAG基线:基于KG的GraphRAG,使用三元组提取;基于社区的GraphRAG,具备局部/全局检索自分层社区。
- 在问答任务(单跳/多跳、单文档/多文档)和基于查询的摘要(单文档/多文档)上使用标准指标进行评估。
- 在方法间使用相同的分块、嵌入和LLM以确保公平比较。
- 评估两种整合策略:选择(对查询进行路由到RAG或GraphRAG)与整合(两者联合检索)。
- 在以LLM为评审的摘要评估设置中分析评估偏差。

实验结果
研究问题
- RQ1RAG和GraphRAG在通用文本基础问答和摘要基准上的相对优势是什么?
- RQ2在哪些情景(单跳与多跳、单文档与多文档)各自表现出色或不足?
- RQ3我们能否设计策略以利用RAG和GraphRAG的互补优势来提升性能?
- RQ4将GraphRAG应用于文本任务时的局限性和未来方向是什么?
主要发现
- RAG在需要细节性单跳查询和需要明确事实细节的任务上表现出色。
- GraphRAG(尤其是Community-GraphRAG Local)在多跳推理任务上表现突出。
- Community-GraphRAG的全局检索在问答中常表现不佳且可能产生幻觉,尽管在比较/时序查询中可能有帮助。
- KG-based GraphRAG因图中信息不完整而表现不佳(KG中仅约65%的答案实体存在)。
- 选择与整合策略通常能提升问答性能,整合在较高计算成本下带来更大收益。
- 在基于查询的摘要方面,RAG通常表现良好,KG-GraphRAG从三元组+文本中受益,Community-GraphRAG的本地检索有利;全局检索聚焦于语料库级摘要,结果波动。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。