[论文解读] Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering
该论文对 Vanilla LLM、Basic RAG 和 Advanced RAG 管道(带跨编码器重排序)在 CDC 文档上进行政策问答的实证比较,结果显示跨编码器重排序显著提升了可信度和相关性,Advanced RAG 取得最高分。
The integration of Large Language Models (LLMs) into the public health policy sector offers a transformative approach to navigating the vast repositories of regulatory guidance maintained by agencies such as the Centers for Disease Control and Prevention (CDC). However, the propensity for LLMs to generate hallucinations, defined as plausible but factually incorrect assertions, presents a critical barrier to the adoption of these technologies in high-stakes environments where information integrity is non-negotiable. This empirical evaluation explores the effectiveness of Retrieval-Augmented Generation (RAG) architectures in mitigating these risks by grounding generative outputs in authoritative document context. Specifically, this study compares a baseline Vanilla LLM against Basic RAG and Advanced RAG pipelines utilizing cross-encoder re-ranking. The experimental framework employs a Mistral-7B-Instruct-v0.2 model and an all-MiniLM-L6-v2 embedding model to process a corpus of official CDC policy analytical frameworks and guidance documents. The analysis measures the impact of two distinct chunking strategies, recursive character-based and token-based semantic splitting, on system accuracy, measured through faithfulness and relevance scores across a curated set of complex policy scenarios. Quantitative findings indicate that while Basic RAG architectures provide a substantial improvement in faithfulness (0.621) over Vanilla baselines (0.347), the Advanced RAG configuration achieves a superior faithfulness average of 0.797. These results demonstrate that two-stage retrieval mechanisms are essential for achieving the precision required for domain-specific policy question answering, though structural constraints in document segmentation remain a significant bottleneck for multi-step reasoning tasks.
研究动机与目标
- 在权威政策指南的基础上对 LLM 输出进行定位以降低在公共卫生情境中的幻觉现象。
- 对 CDC 政策文档语料库进行检索增强生成管道的评估。
- 量化分块策略和两阶段检索对答案的可信度和相关性的影响。
提出的方法
- 实现一个双阶段检索管线,先使用双塔编码器进行初步检索,再使用跨编码器进行重排序。
- 使用嵌入模型(all-MiniLM-L6-v2)对 CDC 政策文档语料库进行处理,并用跨编码器(ms-marco-MiniLM-L-6-v2)进行处理。
- 将检索形式化为超检取与筛选,以从提示 LLM 的候选集合中选取 top-k。
- 比较三种系统配置:Vanilla LLM、Basic RAG 和 Advanced RAG。
- 在一个包含 10 个问答的评估集上,使用可信度和相关性分数衡量性能。
- 提供定性示例并分析 Advanced RAG 下的失败模式与恢复情况。
实验结果
研究问题
- RQ1高级检索技术(跨编码器重排序)在提升政策问答输出的可信度和相关性方面有多大作用?
- RQ2分块策略对 grounding 政策答案有何影响?
- RQ3是否需要两阶段检索管线以实现高精度的政策-grounded 回答?
主要发现
| QID | Van Faithfulness | Bas Faithfulness | Adv Faithfulness | Van Relevance | Bas Relevance | Adv Relevance |
|---|---|---|---|---|---|---|
| Q1 | 0.33 | 0.33 | 0.67 | 0.50 | 1.00 | 1.00 |
| Q2 | 0.33 | 0.67 | 0.83 | 0.33 | 1.00 | 1.00 |
| Q3 | 0.33 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 |
| Q4 | 0.33 | 0.33 | 0.16 | 0.50 | 0.50 | 0.50 |
| Q5 | 0.25 | 0.50 | 0.25 | 0.33 | 0.67 | 0.33 |
| Q6 | 0.33 | 0.67 | 1.00 | 0.33 | 0.80 | 1.00 |
| Q7 | 0.00 | 0.71 | 0.29 | 0.00 | 1.00 | 0.50 |
| Q8 | 0.40 | 0.00 | 0.80 | 0.50 | 0.00 | 0.67 |
| Q9 | 0.50 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 |
| Q10 | 0.67 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 |
- Vanilla LLM 在面向政策的任务中因幻觉而表现不佳。
- Basic RAG 在可信度方面显著提升(0.621)相较于 Vanilla(0.347),在相关性上也有提升(0.70 对比某些情况下的 0.45)。
- Advanced RAG 实现了最高的可信度平均值(0.797)并通过跨编码器重排序获得最佳的整体 grounding。
- 两阶段检索(先用双塔编码器再用跨编码器)通过将跨编码器评估限定在较小的候选集合中(top-k),显著提升了精确度。
- 在定性案例中,Advanced RAG 能检索出与 CDC 框架一致的政策情境,降低了政策情境偏移和幻觉。
- Basic RAG 可能不稳定并在检索上下文不相关时在某些查询上失败,而 Advanced RAG 通过精确的逐字对齐实现恢复。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。