QUICK REVIEW

[论文解读] Chain-of-Retrieval Augmented Generation

Liang Wang, Haonan Chen|ArXiv.org|Jan 24, 2025

Speech and dialogue systems被引用 3

一句话总结

CoRAG 通过在链中迭代检索与推理来改进信息检索增强生成模型，从而提升多跳问答和知识密集型任务的性能，并在推理时通过策略控制计算量。

ABSTRACT

This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.

研究动机与目标

通过实现迭代检索与推理来改进 RAG，而不仅仅是单一步检索。
使用拒绝采样对 QA 数据集进行中间检索链的扩充。
训练大语言模型以预测检索与生成链中的下一个动作。
研究推理时的解码策略，以扩展基于链的检索的计算能力。
在多跳问答数据集和 KILT 基准上评估 CoRAG，以评估泛化性和可扩展性。

提出的方法

通过拒绝采样生成检索链，用子查询与子答案的序列对 QA 数据集进行扩充。
在扩充数据上对开源大语言模型进行多任务目标微调，覆盖子查询、子答案和最终答案的预测。
使用检索器对每个子查询获取前 k 条文档，并通过链的对数似然来评估链质量。
提供推理时的解码策略，包括贪婪、最佳的 N 次采样，以及树搜索，以控制令牌消耗。
分析在数据集、检索器和泛化场景下的放缩行为和鲁棒性。
可选地，在测试时通过预测当前信息是否足够来学习停止链条的策略。

(a) Test-time scaling behavior of CoRAG.

实验结果

研究问题

RQ1迭代检索与推理能否在复杂的多跳问答任务中相对于单步 RAG 提高性能？
RQ2推理时的计算量（链长度和链数量）如何影响性能与效率？
RQ3检索链是否能在超越问答的多样知识密集型任务上实现泛化？
RQ4使用更弱或更强的检索器与模型对 CoRAG 的有效性有何影响？
RQ5在推断时学习检索链的提前停止机制是否有益？

主要发现

CoRAG 在多跳问答数据集上显著超越强基线，在多种解码策略下，EM/F1 有显著提升。
在 KILT 基准上，CoRAG 在多样任务中达到最先进的性能，只有在最大的 FEVER 任务上存在一些例外。
较长的检索链在从短链开始时能提升性能，但随着链长增加，增益趋于减弱。
推理时的放缩行为在若干数据集上呈现令牌消耗与性能之间的对数线性关系。
鲁棒性实验显示使用更强的检索器具有优势，使用较弱的检索器也有一定提升，但对不同任务类型的泛化仍然有利。
消融实验表明迭代训练结果参差不齐，表明对指令微调的 LLM 常常能生成高质量的检索链。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。