QUICK REVIEW

[論文レビュー] The Chronicles of RAG: The Retriever, the Chunk and the Generator

Paulo Finardi, Leonardo Avila|arXiv (Cornell University)|Jan 15, 2024

Topic Modeling被引用数 15

ひとこと要約

この論文は、ブラジル Portuguese の Retrieval Augmented Generation (RAG) パイプラインを設計、最適化、評価する方法を研究し、複数の検索手法（スパース、デンス、ハイブリッド）とリランカーを比較し、Harry Potter データセットで高い相対性能を達成するベストプラクティスのエンドツーエンド構成を報告します。

ABSTRACT

Retrieval Augmented Generation (RAG) has become one of the most popular paradigms for enabling LLMs to access external data, and also as a mechanism for grounding to mitigate against hallucinations. When implementing RAG you can face several challenges like effective integration of retrieval models, efficient representation learning, data diversity, computational efficiency optimization, evaluation, and quality of text generation. Given all these challenges, every day a new technique to improve RAG appears, making it unfeasible to experiment with all combinations for your problem. In this context, this paper presents good practices to implement, optimize, and evaluate RAG for the Brazilian Portuguese language, focusing on the establishment of a simple pipeline for inference and experiments. We explored a diverse set of methods to answer questions about the first Harry Potter book. To generate the answers we used the OpenAI's gpt-4, gpt-4-1106-preview, gpt-3.5-turbo-1106, and Google's Gemini Pro. Focusing on the quality of the retriever, our approach achieved an improvement of MRR@10 by 35.4% compared to the baseline. When optimizing the input size in the application, we observed that it is possible to further enhance it by 2.4%. Finally, we present the complete architecture of the RAG with our recommendations. As result, we moved from a baseline of 57.88% to a maximum relative score of 98.61%.

研究の動機と目的

ブラジル Portuguese のデータセットを作成し、RAG の各ステップを定量化する方法論を提案する。
完璧な RAG システムとの差を定量化する指標（相対最大スコア）を導入する。
ポルトガル語における RAG のベストプラクティスを確立するため、さまざまなリトリーバとパイプラインアーキテクチャを評価する。

提案手法

1000 トークンのチャンクを用いて、ブラジル Portuguese のハリー・ポッターから 140 問の QA データセットを作成する。
RAG 構成の潜在的な性能を測定するために相対最大スコアを定義する。
バックグラウンド文脈なしのベースライン、長い文脈プロンプト、および素朴な RAG を、BM25、ADA-002 dense、カスタム ADA-002 を含む複数のリトリーバに対して評価し、ハイブリッドおよび多段階リランキングを含む。
検索手の比較にはリコールベースの指標（R@k）と MRR@k を用いる。
BM25 を第一段階、mt5 ベースのリランカーを第二段階とする retrieve-and-rerank の多段階アーキテクチャを実装する。
入力サイズ（取得したチャンク数）とプロンプト内の回答の位置を実験し、文脈効果を調べる。

実験結果

リサーチクエスチョン

RQ1さまざまなリトリーバ（スパース、デンス、ハイブリッド）がブラジル Portuguese の RAG性能にどのように影響するか？
RQ2チャンク化戦略と回答の配置が RAG の品質に与える影響は何か？
RQ3この設定で retrieve-and-rerank の多段階パイプラインは、ベースラインや素朴な RAG 構成を上回ることができるか？
RQ4入力サイズと RAG 性能の関係は何か、取得すべき最適なチャンク数は何か？

主な発見

指標	ADA-002	Custom ADA-002	Hybrid-BM25-ADA-002	Hybrid-BM25-Custom ADA-002	BM25	BM25 + Reranker
MRR@10	0.565	0.665	0.758	0.850	0.879	0.919
R@3	0.628	0.735	0.829	0.921	0.914	0.971
R@5	0.692	0.835	0.879	0.943	0.971	0.985
R@7	0.750	0.871	0.921	0.964	0.985	0.992
R@9	0.814	0.921	0.957	0.979	0.985	1

Dense および Hybrid のリトリーバは Naive ベースラインに比べて RAG の性能を大幅に改善し、MRR@10 はベースラインに対して約 35.4 ポイント程度改善した。
入力サイズの最適化により追加で約 2.4% の性能向上が得られる。
3 つの retrieved chunks を取得する BM25- first の多段階とニューラルリランカー（mt5 ベース）を組み合わせた構成が最良の結果を出し、この設定で最新技術に近づいた。
最終的なベスト構成は相対最大スコア 98.61% を達成し、ベースラインより大幅に改善された（劣化スコアで 40.73 ポイント）。
本研究はリトリーバの品質と全体の RAG 性能との強い関連を示すとともに、入出力の慎重な構成と評価フレームワークの重要性を強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。