QUICK REVIEW

[論文レビュー] LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neel Guha, Julian Nyarko|arXiv (Cornell University)|Aug 20, 2023

Artificial Intelligence in Law被引用数 28

ひとこと要約

LegalBench は、協働で作成された、オープンソースのベンチマークで、6つの推論タイプにまたがる162の法的推論タスクを用いてLLMを評価する。学際的な構築プロセスと20モデルの初期実証評価。

ABSTRACT

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

研究の動機と目的

LLMs における法的推論のための厳密で領域に整合したベンチマークの必要性を動機づける。
IRACと法実務に基づく法的推論のタイプ学を提示する。
LegalBench の構築、文書化、協働プロセスを説明する。
多様なタスクタイプとプロンプトを跨ぐ複数の LLMs の初期実証評価を提供する。
法的 AI におけるさらなる学際的研究と実践的応用を可能にするプラットフォームを提供する。

提案手法

six-type legal reasoning typology の導入（issue-spotting, rule-recall, rule-application, rule-conclusion, interpretation, rhetorical-understanding）。
法曹専門家による手作業データセットを含む36データ源からの162タスクを組み立てる、再構築された既存コーパスを含む。
再現性を可能にするための文書化、ベースプロンプト、評価プロトコルでタスクを整理する。
標準化されたプロンプトとプロンプトエンジニアリング戦略を用いて、サイズ別に11ファミリーの20 LLMを評価する。
rule-application タスクに対する解答ガイドと多面的評価（正確さと分析）を提供する。
制限、IRAC との相互運用性、政策、安全性、および今後の課題への影響について論じる。

Figure 1: We compare performance of prompts which describe the legal rule to be applied (“description”) against prompts which reference the legal rule to be applied (“reference”). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

実験結果

リサーチクエスチョン

RQ1LLMs はどのような法的推論を実行でき、どのようにして細かく領域に整合したベンチマークで測定できるか?
RQ2協働的で領域専門家主導のプロセスは、法務分野における LLM 評価の関連性と有用性をどのように向上させるか?
RQ3異なる LLM は、法的タスクとプロンプト戦略の詳細な typology にわたってどのようにパフォーマンスを示すか?
RQ4LegalBench のタスクを米国外の法域や長文文書へどの程度拡張できるか?

主な発見

LegalBench は6つの推論タイプを網羅する162タスクを法的枠組みと実務から引き出して提供する。
ベンチマークは標準化されたプロンプト、デモ、評価プロトコルを可能にし、法的文脈での LLM パフォーマンスを研究する。
20 LLM を跨ぐ初期実験は、タスクタイプごとに異なる強みを示し、プロンプトエンジニアリング戦略の洞察を明らかにする（論文の詳述参照）。
LegalBench は、実用的で解釈可能な評価を確保するためのタスク構築における領域専門家の入力の重要性を強調する。
法律語と実務的影響が広くあるため、解釈系および契約関連タスクを意図的に重視している。
著者は制限（例: 英語とアメリカ法の焦点、短い文脈ウィンドウ）を論じ、今後の拡張の方向性を概説する。

Figure 2: We compare performance of prompts which describe the task in plain language to prompts which describe the task in technical legal language (for GPT-3.5). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。