[논문 리뷰] Knowledge Graph Prompting for Multi-Document Question Answering
짧은 요약: 본 논문은 다문서 질의응답(MD-QA)을 위한 Knowledge Graph Prompting(KGP)을 소개하고, 단락과 문서 구조 위에 지식 그래프를 구축하며 LLM 기반의 탐색 에이전트를 사용하여 문서 간 질문에 대한 맥락 증거를 검색한다.
The `pre-train, prompt, predict' paradigm of large language models (LLMs) has achieved remarkable success in open-domain question answering (OD-QA). However, few works explore this paradigm in the scenario of multi-document question answering (MD-QA), a task demanding a thorough understanding of the logical associations among the contents and structures of different documents. To fill this crucial gap, we propose a Knowledge Graph Prompting (KGP) method to formulate the right context in prompting LLMs for MD-QA, which consists of a graph construction module and a graph traversal module. For graph construction, we create a knowledge graph (KG) over multiple documents with nodes symbolizing passages or document structures (e.g., pages/tables), and edges denoting the semantic/lexical similarity between passages or intra-document structural relations. For graph traversal, we design an LLM-based graph traversal agent that navigates across nodes and gathers supporting passages assisting LLMs in MD-QA. The constructed graph serves as the global ruler that regulates the transitional space among passages and reduces retrieval latency. Concurrently, the graph traversal agent acts as a local navigator that gathers pertinent context to progressively approach the question and guarantee retrieval quality. Extensive experiments underscore the efficacy of KGP for MD-QA, signifying the potential of leveraging graphs in enhancing the prompt design for LLMs. Our code: https://github.com/YuWVandy/KG-LLM-MDQA.
연구 동기 및 목표
- Motivate MD-QA beyond open-domain QA by requiring cross-document reasoning and structured content understanding.
- Propose a generally-applicableKG construction method that encodes lexical/semantic similarity and document structure relations.
- Develop an LLM-guided graph traversal agent to adaptively retrieve relevant contexts.
- Demonstrate that graph-based prompting improves MD-QA performance and retrieval efficiency across multiple datasets.
제안 방법
- Construct knowledge graphs where nodes are passages or document structures (pages/tables) and edges encode lexical/semantic similarity or structural relations.
- Augment graphs with structural nodes (pages, tables) and use markdown content for tables to aid LLM understanding.
- Train or fine-tune an LLM-based graph traversal agent that, given visited passages, selects the next best neighbor to visit to approach the answer.
- Employ instruction-finetuning to enhance the reasoning capability of the traversal agent to mitigate hallucinations.
- Explore multiple KG construction strategies (TF-IDF, KNN-MDR, KNN-ST, TAGME) and compare their effectiveness and trade-offs.
- Integrate the traversal process with a prompt design that uses the retrieved passages to answer MD-QA questions.
실험 결과
연구 질문
- RQ1How can a knowledge graph over documents improve MD-QA prompting and retrieval compared to baseline methods?
- RQ2What KG construction strategies best capture the necessary cross-document reasoning for MD-QA?
- RQ3Can an LLM-guided KG traversal agent effectively navigate the graph to retrieve relevant context for answering questions?
- RQ4How does incorporating document structures (pages/tables) influence MD-QA performance?
- RQ5What are the performance and efficiency trade-offs as KG density and traversal strategies vary?
주요 결과
| 방법 | HotpotQA 정확도 | HotpotQA EM | HotpotQA F1 | IIRC 정확도 | IIRC EM | IIRC F1 | 2WikiMQA 정확도 | 2WikiMQA EM | 2WikiMQA F1 | MuSiQue 정확도 | MuSiQue EM | MuSiQue F1 | PDFTriage 구조-EM | w PDFTriage 정확도 | w PDFTriage EM | w PDFTriage F1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | 41.80 | 19.00 | 30.50 | 19.50 | 8.60 | 13.17 | 44.40 | 18.60 | 25.07 | 30.40 | 4.60 | 10.58 | 0.00 | 8.53 | 9.00 | |
| KNN | 71.57 | 40.73 | 57.97 | 43.82 | 25.15 | 37.24 | 52.40 | 31.20 | 42.13 | 44.70 | 18.86 | 30.04 | – | 7.00 | 7.33 | |
| TF-IDF | 76.64 | 45.97 | 64.64 | 47.47 | 27.22 | 40.80 | 58.40 | 34.60 | 44.50 | 44.40 | 21.59 | 32.50 | – | 4.85 | 5.00 | |
| BM25 | 71.95 | 41.46 | 59.73 | 41.93 | 23.48 | 35.55 | 55.80 | 30.80 | 40.55 | 44.47 | 21.11 | 31.15 | – | 6.92 | 7.25 | |
| DPR | 73.43 | 43.61 | 62.11 | 48.11 | 26.89 | 41.85 | 62.40 | 35.60 | 51.10 | 44.27 | 20.32 | 31.64 | – | 5.31 | 5.50 | |
| MDR | 75.30 | 45.55 | 65.16 | 50.84 | 27.52 | 43.47 | 63.00 | 36.00 | 52.44 | 48.39 | 23.49 | 37.03 | – | 3.07 | 3.08 | |
| IRCoT | 74.36 | 45.29 | 64.12 | 49.78 | 27.73 | 41.65 | 61.81 | 37.75 | 50.17 | 45.14 | 22.46 | 34.21 | – | 4.00 | 4.08 | |
| KGP-T5 | 76.53 | 46.51 | 66.77 | 48.28 | 26.94 | 41.54 | 63.50 | 39.80 | 53.50 | 50.92 | 27.90 | 41.19 | 67.00 | 2.69 | 2.75 | |
| Golden | 82.19 | 50.20 | 71.06 | 62.68 | 35.64 | 54.76 | 72.60 | 40.20 | 59.69 | 57.00 | 30.60 | 47.75 | 100.00 | 1.00 | 1.00 |
- KGP-T5 achieves top performance on MD-QA benchmarks, often outperforming baselines except for the Golden context.
- MDR-based traversals and KGs tuned with domain-specific pretraining yield stronger results than generic embedding-based methods (DPR).
- KGs incorporating structural nodes enable handling structural questions (e.g., differences between Page 1 and Page 2) with substantial Struct-EM gains (67% reported in Table 1).
- GPT/LMM-based traversal agents significantly outperform random traversal and can surpass several baseline retrievers in accuracy and F1 across HotpotQA, 2WikiMQA, MuSiQue, and IIRC.
- Trade-offs exist between KG density and retrieval latency: higher density improves EM/F1 but increases latency; a well-tuned branching factor is crucial for maximizing performance under a fixed context budget.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.