QUICK REVIEW

[论文解读] Learning to Filter Context for Retrieval-Augmented Generation

Zhiruo Wang, Jun Araki|arXiv (Cornell University)|Nov 14, 2023

Topic Modeling被引用 8

一句话总结

FilCo 提出句子级上下文筛选，通过在 StrInc、Lexical overlap 和 CXMI 指标上训练上下文筛选器，降低噪声与计算量，同时在六项知识密集型任务上提升性能。

ABSTRACT

On-the-fly retrieval of relevant knowledge has proven an essential element of reliable systems for tasks such as open-domain question answering and fact verification. However, because retrieval systems are not perfect, generation models are required to generate outputs given partially or entirely irrelevant passages. This can cause over- or under-reliance on context, and result in problems in the generated output such as hallucinations. To alleviate these problems, we propose FILCO, a method that improves the quality of the context provided to the generator by (1) identifying useful context based on lexical and information-theoretic approaches, and (2) training context filtering models that can filter retrieved contexts at test time. We experiment on six knowledge-intensive tasks with FLAN-T5 and LLaMa2, and demonstrate that our method outperforms existing approaches on extractive question answering (QA), complex multi-hop and long-form QA, fact verification, and dialog generation tasks. FILCO effectively improves the quality of context, whether or not it supports the canonical output.

研究动机与目标

说明在检索增强生成中提高上下文质量的必要性，以减少对不相关段落的依赖。
提出 FilCo，一种由三个度量引导、用于选择有用文本片段的句子级上下文筛选方法。
证明学习的上下文筛选在六个知识密集型数据集上提升性能。
展示显著的输入长度缩减，以及与银标准筛选相比相当或更优的增益。
提供对每个任务最有效筛选信号的指导，并将评估扩展到多段落场景。

提出的方法

定义一个细粒度的筛选函数，从检索的段落中选择句子级文本片段。
使用三种准 oracle 式筛选信号：StrInc（片段是否包含输出）、Lexical overlap（与示例/输出的 unigram 重叠）、CXMI（加入上下文时生成概率的变化）。
训练一个上下文筛选模型 M_ctx，以从 q 与检索段落 P 中预测经过筛选的上下文 t_silver，使用 oracle 筛选信号作为监督。
训练一个生成模型 M_gen，以在推断时给定经过筛选的上下文 t_silver 产生目标 o（训练阶段）。
在测试时，使用 M_ctx 生成 t_pred，并将 q 与 t_pred 拼接后输入给 M_gen 以产生 o。
将 FilCo（句子级筛选）与 Full-context 增强以及基于段落的筛选（Psg）进行比较，并在银标准的上界情形下直接使用 t_silver。

Figure 1: FilCo filters out irrelevant content (marked in red) and leaves precisely supporting content, making it easier for the generator to predict the correct answer.

实验结果

研究问题

RQ1句子级上下文筛选是否能在多项知识密集型任务中提升检索增强生成输出的保真度与准确性？
RQ2哪些筛选信号（StrInc、Lexical、CXMI）最适合不同任务类型（提取式问答、抽取式问答、事实核验、对话）？
RQ3学习到的筛选是否能在不牺牲且通常提升终端任务性能的前提下减少输入长度和计算成本？
RQ4在多段落设置下，FilCo 相对于单段落筛选和全上下文基线的表现如何？

主要发现

FilCo 在六项任务（包括提取式问答、多跳问答、长文本问答、事实核验与对话生成等）中持续优于全上下文增强和基于段落的筛选。
FilCo 在各任务中将输入长度缩减 44-64%，同时获得相当或优于银筛选上下文的终端任务结果。
不同任务受益于不同的筛选信号：提取式问答用 StrInc，对话用 Lexical，复杂任务如多跳问答与事实核验用 CXMI。
FilCo 在最终生成指标上实现显著提升（例如使用 Flan-T5 和 Llama2 时，NQ 的 EM 提升分别为 4.3 和 8.6；ELI5 的 F1 提升为 0.6 和 2.6；FEVER 的准确率提升分别为 0.6 和 3.5）。
在多段落设置下，FilCo 维持对基线的优势，在使用前五个段落时在若干任务上还得到额外提升。

Figure 2: The FilCo pipeline: (i) filtering retrieved passages, (ii) generation with filtered context.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。