[论文解读] Knowledge Graph Prompting for Multi-Document Question Answering
本文提出面向多文档问答的知识图提示(KGP),在段落与文档结构上构建知识图谱,并使用基于大语言模型的遍历代理来检索跨文档回答问题所需的上下文证据。
The `pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in the scenario of multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of different documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or intra-document structural relations. For graph traversal, we design an LLM-based graph traversal agent that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the graph traversal agent acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design for LLMs. Our code: https://github.com/YuWVandy/KG-LLM-MDQA.
研究动机与目标
- 通过要求跨文档推理和对结构化内容的理解,将MD-QA的动机提升到超越开放域问答。
- 提出一种通用适用的KG构建方法,编码词汇/语义相似性及文档结构关系。
- 开发一个由LLM引导的图遍历代理,能自适应地检索相关上下文。
- 证明基于图的提示在多个数据集上提升MD-QA表现和检索效率。
提出的方法
- 构建知识图谱,其中节点是段落或文档结构(页面/表格),边缘编码词汇/语义相似性或结构关系。
- 为图引入结构节点(页面、表格),并对表格使用Markdown内容以帮助LLM理解。
- 训练或微调一个基于LLM的图遍历代理,在给定已访问的段落时,选择下一个最佳相邻节点以接近答案。
- 采用指令微调以提升遍历代理的推理能力,从而减轻幻觉。
- 探索多种KG构建策略(TF-IDF、KNN-MDR、KNN-ST、TAGME),并比较它们的有效性与权衡。
- 将遍历过程与使用检索得到的段落来回答MD-QA问题的提示设计相结合。
实验结果
研究问题
- RQ1相较于基线方法,基于文档的知识图如何提升MD-QA的提示与检索?
- RQ2哪些KG构建策略能够最好地捕捉MD-QA所需的跨文档推理?
- RQ3LLM引导的KG遍历代理是否能有效导航图以检索回答问题所需的相关上下文?
- RQ4引入文档结构(页面/表格)如何影响MD-QA的性能?
- RQ5随着KG密度和遍历策略的变化,性能与效率的权衡有哪些?
主要发现
| 方法 | HotpotQA 准确率 | HotpotQA EM | HotpotQA F1 | IIRC 准确率 | IIRC EM | IIRC F1 | 2WikiMQA 准确率 | 2WikiMQA EM | 2WikiMQA F1 | MuSiQue 准确率 | MuSiQue EM | MuSiQue F1 | PDFTriage Struct-EM | 带PDFTriage 准确率 | 带PDFTriage EM | 带PDFTriage F1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | 41.80 | 19.00 | 30.50 | 19.50 | 8.60 | 13.17 | 44.40 | 18.60 | 25.07 | 30.40 | 4.60 | 10.58 | 0.00 | 8.53 | 9.00 | |
| KNN | 71.57 | 40.73 | 57.97 | 43.82 | 25.15 | 37.24 | 52.40 | 31.20 | 42.13 | 44.70 | 18.86 | 30.04 | – | 7.00 | 7.33 | |
| TF-IDF | 76.64 | 45.97 | 64.64 | 47.47 | 27.22 | 40.80 | 58.40 | 34.60 | 44.50 | 44.40 | 21.59 | 32.50 | – | 4.85 | 5.00 | |
| BM25 | 71.95 | 41.46 | 59.73 | 41.93 | 23.48 | 35.55 | 55.80 | 30.80 | 40.55 | 44.47 | 21.11 | 31.15 | – | 6.92 | 7.25 | |
| DPR | 73.43 | 43.61 | 62.11 | 48.11 | 26.89 | 41.85 | 62.40 | 35.60 | 51.10 | 44.27 | 20.32 | 31.64 | – | 5.31 | 5.50 | |
| MDR | 75.30 | 45.55 | 65.16 | 50.84 | 27.52 | 43.47 | 63.00 | 36.00 | 52.44 | 48.39 | 23.49 | 37.03 | – | 3.07 | 3.08 | |
| IRCoT | 74.36 | 45.29 | 64.12 | 49.78 | 27.73 | 41.65 | 61.81 | 37.75 | 50.17 | 45.14 | 22.46 | 34.21 | – | 4.00 | 4.08 | |
| KGP-T5 | 76.53 | 46.51 | 66.77 | 48.28 | 26.94 | 41.54 | 63.50 | 39.80 | 53.50 | 50.92 | 27.90 | 41.19 | 67.00 | 2.69 | 2.75 | |
| Golden | 82.19 | 50.20 | 71.06 | 62.68 | 35.64 | 54.76 | 72.60 | 40.20 | 59.69 | 57.00 | 30.60 | 47.75 | 100.00 | 1.00 | 1.00 |
- KGP-T5在MD-QA基准测试中达到最高表现,通常优于基线,唯独在Golden context时例外。
- 基于MDR的遍历和经过领域特定预训练的KG比通用嵌入方法(DPR)产生更强的结果。
- 包含结构节点的KG能够处理结构性问题(如Page 1与Page 2的差异),在Struct-EM上实现显著提升(表1中报道67%)。
- GPT/LMM-based traversal agents significantly outperform random traversal and can surpass several baseline retrievers in accuracy and F1 across HotpotQA, 2WikiMQA, MuSiQue, and IIRC.
- Trade-offs exist between KG density and retrieval latency: higher density improves EM/F1 but increases latency; a well-tuned branching factor is crucial for maximizing performance under a fixed context budget.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。