QUICK REVIEW

[論文レビュー] LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

Nicholas Pipitone, Ghita Houir Alami|arXiv (Cornell University)|Aug 19, 2024

Artificial Intelligence in Law被引用数 7

ひとこと要約

LegalBench-RAG は、法域における RAG システムの検索段階の評価に焦点を当てた初のベンチマークであり、原典に基づく正確なスニペットレベルの検索と、迅速な実験のためのミニ版を使用します。

ABSTRACT

Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of Large Language Models (LLMs) in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.

研究の動機と目的

大規模な法的コーパスの中で正確かつ最小のテキストスニペットを特定することにより、法的 RAG パイプラインの検索精度を評価する。
法的文脈での検索アルゴリズムを比較するための、公開可能で専門家が注釈したデータセットを提供する。
大きく不正確なチャンクを避け、精密な引用を可能にすることで、文脈窓の効率を維持する。

提案手法

LegalBench-RAG を、4つの法的データセット（PrivacyQA, CUAD, MAUD, ContractNLI）内の元の位置に遡る形で構築する。
文書の説明と問いの組み合わせとしてクエリを作成し、グラウンドトゥルーススパン（開始/終了文字位置）を得る。
各クエリに正確に回答するため、(filename, index range) のスパン配列を用いて QA ペアを注釈付けする。
検索を、チャンク化戦略（固定サイズ 500-char チャンク vs Recursive Text Character Splitter）とポスト処理（リランカーなし vs Cohere ランカー）で評価する。
埋め込みには OpenAI text-embedding-3-large を、格納には SQLite Vec を、実験のために選択された構成で Cohere reranker を用いる。

Figure 1: Benchmarking The Retrieval Step Of RAG Systems

実験結果

リサーチクエスチョン

RQ1大規模コーパス内で、法的クエリに回答する正確で最小のテキストスパンを法的検索システムはどれだけうまく特定できるか？
RQ2法的テキストにおいて、どのようなチャンク化およびリランキング戦略が最高の検索の精度と再現率を生むか？
RQ3LegalBench-RAG 内の異なるデータセットは、検索の難易度と性能の点でどのように比較されるか？
RQ4特化した法的テキストに対する汎用リランカーの限界は何か？

主な発見

RTCS (Recursive Text Character Splitter) は、リランカーなしでもデータセット全体で最高の検索性能を達成した。
Cohere’s reranker はこの法的検索ベンチマークにおいて、リランカーなしと比較して性能が低かった。
PrivacyQA は最も取り組みやすいデータセットとして浮上し、MAUD は検索において最も難しいデータセットだった。
PrivacyQA では、RTCS でリランカーなしの場合、Precision@1 が 14.38%、Recall@64 が 84.19% に達した。
MAUD は低い性能を示し、Precision@1 は約 2.65%、Recall@64 は約 28.28% だった。
LegalBench-RAG-mini は、迅速な実験のための軽量な 776 クエリのサブセット（4データセットに跨る）を提供する。）

Figure 2: Recall @ $k$ across all datasets

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。