QUICK REVIEW

[論文レビュー] Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin|arXiv (Cornell University)|Sep 4, 2023

Topic Modeling被引用数 52

ひとこと要約

本論文は RGB という retrieval-Augmented Generation ベンチマークを英語と中国語で導入し、6モデルにわたる RAG を用いた LLM の4つの能力を評価する。ノイズ処理、拒否、情報統合、反事実的頑健性における顕著な制約を明らかにする。

ABSTRACT

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

研究の動機と目的

retrieval augmentation が LLM に与える影響を、コアな RAG 能力（ノイズ耐性、ネガティブ拒否、情報統合、反事実的頑健性）の観点で評価する。
英語/中国語の二言語対応ベンチマーク（RGB）を、最新のニュースと取得済み文書から構築して、公正な評価を可能にする。
現在の RAG を用いる LLM の課題点を診断し、改善方針を示す。
今後の RAG 対応 LLM 開発に向けた分析と方針を提供する。

提案手法

最新のニュース記事からイベント・質問・回答を作成するプロンプトを用いて QA インスタンスを生成し、RGB を構築する。
検索 API で外部文書を取得し、テキストチャンクに変換し、dense retrieval を適用して上位チャンクを選択する。
コーパスを拡張し、4つの RAG 能力に対応する4つのテストベッドに分割する。
英語と中国語データで six LLMs を評価する（ChatGPT、ChatGLM-6B、ChatGLM2-6B、Vicuna-7B、Qwen-7B-Chat、BELLE-7B-2M）。
ノイズ耐性と情報統合には exact-match 精度を、ネガティブ拒否には拒否信号を、反事実頑健性指標には（文書あり/なしでの精度、エラー検出と訂正）を使用する。

実験結果

リサーチクエスチョン

RQ1取得済みの文書を使用した場合、現在の LLM はノイズ耐性でどのように性能を発揮するか？
RQ2取得情報が不十分な場合に、LLM は正しく回答を拒否できるか？
RQ3複数の取得文書から情報をどの程度統合できるか？
RQ4取得文書内の反事実的エラーを LLM はどう扱い、検出・訂正できるか？

主な発見

RAG は一部のモデルの応答精度を向上させるが、ノイズが増えると性能が低下する（例：ノイズ比が0.8を超えると精度が顕著に低下）。
ネガティブ拒否は依然として難しく、評価下で観測された最大拒否率は英語45%、中国語43.33%で、モデルがノイズの多い内容で回答を提供することが多い。
情報統合は弱く、ノイズなしでも最大精度は英語60%、中国語67%で、ノイズがあるとさらに低下する。
反事実的頑健性にはモデルは苦戦しており、文書なしの精度は反事実的文書ありより高く、エラー検出/訂正率は限られている。
言語を超えて、モデルはノイズや文書の不整合に対する感受性が異なり、統合/無視/不整合のエラーが複数のサブ質問を含むシナリオに影響を与える。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。