QUICK REVIEW

[论文解读] LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neel Guha, Julian Nyarko|arXiv (Cornell University)|Aug 20, 2023

Artificial Intelligence in Law被引用 28

一句话总结

LegalBench 引入了一个协作构建、开源的基准，涵盖六种推理类型的162个法律推理任务，用于评估大语言模型（LLMs），并具有跨学科的构建过程以及对20个模型的初步实证评估。

ABSTRACT

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

研究动机与目标

推动需要在 LLM 的法律推理中建立严格、与领域对齐的基准。
提出以 IRAC 与法律实践为基础的法律推理类型学。
描述 LegalBench 的构建、文档化以及协作过程。
提供对多种任务类型和提示的初步经验性评估。
提供一个平台，促进法律人工智能领域的跨学科研究和实际应用。

提出的方法

引入六类型法律推理类型学（问题发现、规则回忆、规则应用、规则结论、解释、修辞理解）。
从36个数据源汇集162个任务，包括由法律专业人士手工制作的数据集以及对现有语料库的重新结构化。
通过文档、基础提示和评估协议来组织任务，以实现可重复性。
使用标准化提示及提示工程策略，对来自11个家族、不同规模的20个LLM进行评估。
为规则应用任务提供答案指南和多方面评估（正确性与分析）。
讨论局限性、与 IRAC 的互操作性，以及对政策、安全与未来工作的影响。

Figure 1: We compare performance of prompts which describe the legal rule to be applied (“description”) against prompts which reference the legal rule to be applied (“reference”). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

实验结果

研究问题

RQ1LLMs 可以执行哪些类型的法律推理，以及如何在细粒度、与领域对齐的基准中对其进行衡量？
RQ2如何通过以领域专家驱动的协作过程来提高对法律领域中 LLM 评估的相关性和实用性？
RQ3不同的 LLM 在详细的法律任务类型学和提示策略下的表现如何？
RQ4LegalBench 的任务在多大程度上可以扩展到非美国司法辖区和更长的文档？

主要发现

LegalBench 提供了来自法律框架和实践的六种推理类型的162个任务。
该基准实现了标准化提示、示例演示和评估协议，以研究 LLM 在法律语境中的表现。
对 20 个 LLM 的初步实验表明在不同任务类型上的强项各异，并揭示了提示工程策略的见解（论文中有细节）。
LegalBench 突出了领域专家输入在任务构建中的重要性，以确保评估具有实际可用性和可解释性。
有意强调解释性和合同相关任务，因为它们具有普遍的法律语言和实际影响。
作者讨论了局限性（如聚焦于英语及美国法、短上下文窗口）并概述未来扩展的方向。

Figure 2: We compare performance of prompts which describe the task in plain language to prompts which describe the task in technical legal language (for GPT-3.5). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。