[论文解读] ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations
引入用于越南医疗法规多跳问答的 ViHERMES 数据集,以及一个图感知的问答系统,在检索基线之上取得更好表现。
Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura-hcmut/ViHERMES.
研究动机与目标
- Motivate the need for multihop regulatory QA in Vietnamese healthcare settings and address the lack of suitable benchmarks.
- Propose ViHERMES as a high-quality, evidence-grounded dataset with diverse dependency patterns across regulations.
- Develop a graph-aware retrieval framework (SRKG) and a multi-agent QA system to improve legally valid, coherent answers.
- Demonstrate empirical gains of the proposed system over strong retrieval-based baselines on ViHERMES.
提出的方法
- Construct ViHERMES via a pipeline combining semantic clustering and graph-inspired data mining to select coherent regulatory contexts.
- Represent regulatory units as nodes in a structure-driven regulatory knowledge graph (SRKG) with structural and legal edges.
- Use seeded retrieval over regulatory units and relation-aware propagation to assemble bounded context sets.
- Employ a multi-agent system (Interpreter, Pathfinder, Auditor, Conductor) to route queries, retrieve evidence, verify grounding, and generate answers.
- Evaluate with token-level F1, LLM-as-a-Judge correctness metrics, and Recall@5 for evidence retrieval.
- Compare against Naive RAG, IRCoT, and graph-based baselines (MiniRAG, RAPTOR, LightRAG, HippoRAG2).

实验结果
研究问题
- RQ1How can multihop reasoning across Vietnamese healthcare regulations be effectively modeled and evaluated?
- RQ2Does a structure-driven SRKG with seeded retrieval and relation-aware propagation improve grounding and accuracy over baselines in regulatory QA?
- RQ3What is the impact of each system component (Interpreter, Pathfinder, Auditor) on overall QA performance?
- RQ4What are the trade-offs between accuracy, grounding reliability, and inference latency in graph-aware regulatory QA?
主要发现
- ViHERMES achieves the best QA performance among evaluated methods on F1, LLM Judge, and Recall@5.
- The proposed system (Ours) attains F1 0.8334, LLM Judge 0.7554, and Recall@5 0.8461 on ViHERMES test set.
- Removing the Auditor or Interpreter degrades performance, highlighting grounding verification and intent routing importance.
- Seeded SRKG-based retrieval with relation-aware propagation outperforms flat dense–sparse retrieval baselines and other graph baselines.
- Inference latency (~14.74s) is competitive with RAPTOR and faster than HippoRAG2, with efficient graph-token utilization.
- Ablations show substantial performance drops when Pathfinder is replaced with non-structure-aware retrieval, validating the SRKG approach.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。