QUICK REVIEW

[论文解读] Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin|arXiv (Cornell University)|Sep 4, 2023

Topic Modeling被引用 52

一句话总结

本文提出 RGB，一个在英语和中文环境下的检索增强生成（retrieval-Augmented Generation）基准，用于在六个模型上评估 LLMs 的四项 RAG 能力，揭示在噪声处理、拒绝/拒答、信息整合以及对反事实鲁棒性方面的显著局限。

ABSTRACT

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

研究动机与目标

评估检索增强在核心 RAG 能力（噪声鲁棒性、负向拒答、信息整合、对反事实鲁棒性）上对 LLMs 的影响。
创建一个双语（英文/中文）的基准集（RGB），基于最新新闻和检索到的文档，以实现公平评估。
诊断瓶颈并为当前使用 RAG 的 LLMs 提供改进指导。
提供分析与面向未来的 RAG 增强型 LLM 开发方向。

提出的方法

通过从最新新闻文章生成问答对实例来构建 RGB，使用提示来创造事件、问题和答案。
通过搜索 API 检索外部文档，将其转换为文本块，并应用密集检索来筛选出前若干块。
扩展语料库并划分为四个测试集，对应四种 RAG 能力。
在英文和中文数据上评估六种 LLM（ChatGPT、ChatGLM-6B、ChatGLM2-6B、Vicuna-7B、Qwen-7B-Chat、BELLE-7B-2M）。
对噪声鲁棒性和信息整合使用精确匹配准确度；对负向拒答使用拒答信号；对对反事实鲁棒性使用指标（有无文档的准确性、错误检测与纠正）。

实验结果

研究问题

RQ1在使用检索到的文档时，当前的 LLM 在噪声鲁棒性方面的表现如何？
RQ2当检索到的信息不足以回答时，LLMs 能否正确拒绝回答？
RQ3LLMs 在多份检索到的文档之间整合信息的程度如何？
RQ4LLMs 如何处理检索文档中的反事实错误，是否能够检测并纠正它们？

主要发现

RAG 在部分模型上提升了回答的准确性，但随着噪声增加，性能下降（例如，当噪声比例大于0.8时，准确性显著下降）。
负向拒答仍然具有挑战性；评估中观察到的最高拒答率为 45%（英文）和 43.33%（中文），表明模型常给出带有噪声内容的答案。
信息整合能力薄弱；即使在无噪声情况下，最大准确性也只有 60%（英文）和 67%（中文），随着噪声增加进一步下降。
模型在对反事实鲁棒性方面表现不佳；在没有文档的情况下的准确性高于存在反事实文档时，错误检测/纠正率也有限。
跨语言模型对噪声和文档错配的易感性不同，合并、忽略或错位错误影响多子问题场景。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。