[论文解读] SF-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Question Answering
SF-RAG 在执行结构保真检索以进行问答时保留了学术论文的原生层级结构,降低检索碎片化,在固定 token 预算下改善证据分配。
Efficient question-answering (QA) over extensive scientific literature is essential for evidence-based engineering decision-making. Retrieval-augmented generation (RAG) is increasingly applied to question-answering over long academic papers, where accurate evidence allocation under a fixed token budget is critical. However, existing approaches flatten papers into unstructured chunks, destroying the native hierarchical structure and forcing retrieval to operate in a disordered space. This produces fragmented contexts, misallocates tokens to non-evidential regions, and increases the reasoning burden for downstream language models.To address these issues, we propose SF-RAG, an RAG framework that treats the native hierarchical structure of academic papers as a low-entropy retrieval prior.SF-RAG first inherits the native hierarchy to construct a structure-fidelity index, which prevents entropy increase at the source.It then designs a path-guided retrieval mechanism that aligns query semantics to relevant sections and selects high relevance root-to-leaf paths under a fixed token budget, yielding compact, coherent, and low-entropy retrieval contexts.In contrast to existing RAG approaches, SF-RAG avoids entropy increase caused by destructive preprocessing and provides a native low-entropy structural basis for subsequent retrieval. We further introduce entropy-based structural diagnostics to quantify retrieval fragmentation and evidence allocation accuracy.Evaluations across three QA benchmarks show that SF-RAG significantly reduces retrieval fragmentation and improves evidence allocation. These structural benefits drive superior answer quality, establishing a scalable foundation for intelligent engineering document systems and future applications in technical specifications.
研究动机与目标
- 推动在长篇科学文献上进行高效问答以支持基于证据的工程决策。
- 识别将论文扁平化为无结构片段的局限及其对证据分配的影响。
- 提出一个检索增强生成框架,保持结构以降低检索中的熵。
提出的方法
- 继承原生论文层级以构建结构保真索引。
- 设计一个路径引导的检索机制,使查询与相关章节和从根到叶的路径在 token 预算内对齐。
- 通过避免破坏性预处理并保持结构上下文来实现低熵检索。
- 引入基于熵的结构诊断以量化碎片化和证据分配。
实验结果
研究问题
- RQ1在基于 RAG 的问答中,保留学术论文的原生层级结构是否能减少检索碎片化?
- RQ2在固定 token 预算下,结构保真检索是否能改善证据分配和答案质量?
- RQ3基于熵的结构诊断如何反映学术问答中的检索性能?
- RQ4路径引导检索对将查询与相关章节对齐的影响如何?
主要发现
- 与基线相比,SF-RAG 显著减少检索碎片化。
- SF-RAG 通过在检索上下文中保持结构一致性来改善证据分配。
- 在固定 token 约束下,结构感知方法在学术问答基准测试中获得更高的答案质量。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。