QUICK REVIEW

[論文レビュー] Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Shihua Zhao, Yuqing Yang|arXiv (Cornell University)|Sep 23, 2024

Library Science and Information Systems被引用数 12

ひとこと要約

データ拡張LLMクエリを4つのレベルに分類し、RAGの強化、検索戦略、統合形態について論じる総合的な調査。

ABSTRACT

Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks. Techniques for integrating external data into LLMs, such as Retrieval-Augmented Generation (RAG) and fine-tuning, are gaining increasing attention and widespread application. Nonetheless, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges. These challenges encompass a wide range of issues, from retrieving relevant data and accurately interpreting user intent to fully harnessing the reasoning capabilities of LLMs for complex tasks. We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. In this survey, we propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required and primary focus of the task: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. We define these levels of queries, provide relevant datasets, and summarize the key challenges and most effective techniques for addressing these challenges. Finally, we discuss three main forms of integrating external data into LLMs: context, small model, and fine-tuning, highlighting their respective strengths, limitations, and the types of problems they are suited to solve. This work aims to help readers thoroughly understand and decompose the data requirements and key bottlenecks in building LLM applications, offering solutions to the different challenges and serving as a guide to systematically developing such applications.

研究の動機と目的

データ拡張LLMアプリケーションの構造化された視点を定義し、外部データがLLMの性能を向上させる理由を説明する。
データ拡張タスクの4段階のクエリ分類を提案する（明示的事実、暗示的事実、解釈可能な根拠、隠れた根拠）。
RAGおよび代替手法に関する課題、データセット、効果的な技術を調査する。
文脈、小型モデル、ファインチューニングという外部データ統合の3つの主要形態と、それぞれのトレードオフを論じる。

提案手法

データ D が与えられた場合のデータ拡張LLMアプリケーションの形式的な問題定義を f: Q -> A として提示する。
クエリを4段階に分類し、データセットをレベルに対応付ける（Table 1 参照）。
RAGの構成要素を詳述する：データ処理、データ検索（スパース、デンス、ハイブリッド）、文書/クエリの整合、リランキング、反復検索。
ノイズのある検索の処理やリトリーバとジェネレーターの共同再学習を含む応答生成の強化を説明する。
より高次の（暗示的な）事実クエリのための反復的、グラフ/木構造、SQLベースのアプローチを導入する。
RAGを超える代替的なデータ統合戦略を議論する。知識グラフ、グラフベースの推論、チャンクベースのプロンプトを含む。

実験結果

リサーチクエスチョン

RQ1外部データのニーズとタスクの焦点に基づいて、ユーザーのクエリをどのようにレベル分けできるか。
RQ24つのレベル全体で、LLMが外部データを取得・利用する際の主な課題と効果的な解決策は何か。
RQ3文脈、小型モデル、ファインチューニングアプローチの長所と限界は何か。
RQ4明示的/暗示的事実および根拠クエリレベルを示すデータセットは何で、それらは既存のタスクにどう対応するか。

主な発見

RAGは未構造・マルチモーダルデータ全体にわたるデータ処理と検索の課題を伴う、明示的事実クエリの核心的な解決策であり続ける。
反復的で階層的な検索戦略は、マルチホップおよび複雑な暗示的事実クエリに対処するのに役立つ。
3つのデータ統合形態（文脈、小型モデル、ファインチューニング）は、制御性、効率性、ドメイン適応の点でそれぞれ異なるトレードオフを提供する。
アライメント戦略（伝統的、文書/ドメイン、クエリ-ドメイン）とリランキングは検索品質にとって重要で、HyDEやSlimPLMのような手法が改善に寄与している。
ファインチューニングと共同再訓練を通じてノイズの多い検索を扱うことで、生成を安定化させ、データ拡張LLMの幻覚を減らすことができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。