QUICK REVIEW

[논문 리뷰] LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

Neel Guha, Julian Nyarko|arXiv (Cornell University)|2023. 08. 20.

Artificial Intelligence in Law인용 수 28

한 줄 요약

LegalBench는 LLM을 평가하기 위해 여섯 가지 추론 유형에 걸친 162개의 법적 추론 과제의 공동 구축(open-source) 벤치마크를 협력적으로 구성했으며, 학제 간 구성 과정과 20개 모델에 대한 초기 경험적 평가를 포함합니다.

ABSTRACT

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a collaboratively constructed legal reasoning benchmark consisting of 162 tasks covering six different types of legal reasoning. LegalBench was built through an interdisciplinary process, in which we collected tasks designed and hand-crafted by legal professionals. Because these subject matter experts took a leading role in construction, tasks either measure legal reasoning capabilities that are practically useful, or measure reasoning skills that lawyers find interesting. To enable cross-disciplinary conversations about LLMs in the law, we additionally show how popular legal frameworks for describing legal reasoning -- which distinguish between its many forms -- correspond to LegalBench tasks, thus giving lawyers and LLM developers a common vocabulary. This paper describes LegalBench, presents an empirical evaluation of 20 open-source and commercial LLMs, and illustrates the types of research explorations LegalBench enables.

연구 동기 및 목표

LLM의 법적 추론에 대해 엄밀하고 도메인 정합적인 벤치마크의 필요성을 제기한다.
IRAC 및 법률 실무에 기반한 법적 추론의 유형학을 제시한다.
LegalBench의 구축, 문서화, 협력 프로세스를 설명한다.
다양한 과제 유형과 프롬프트에 걸쳐 다수의 LLM에 대한 초기 경험적 평가를 제공한다.
법 AI 분야의 추가적인 학제 간 연구 및 실용적 적용을 가능하게 하는 플랫폼을 제공한다.

제안 방법

여섯 가지 유형의 법적 추론 유형학(이슈 포착, 규칙 회상, 규칙 적용, 규칙 결론, 해석, 수사적 이해)을 도입한다.
법률 전문가가 손으로 만든 데이터 세트를 포함하고 기존 말뭉치를 재구성한 36개 데이터 소스에서 162개의 과제를 모은다.
복제 가능성을 높이기 위해 문서화, 기본 프롬프트, 평가 프로토콜로 과제를 구성한다.
표준화된 프롬프트 및 프롬프트 엔지니어링 전략을 사용하여 크기별로 11 계열의 20개 LLM을 평가한다.
규칙 적용 과제에 대한 정답 가이드 및 정확성 및 분석의 다각적 평가를 제공한다.
한계점, IRAC와의 상호 운용성, 정책, 안전성 및 향후 연구에 대한 시사점을 논의한다.

Figure 1: We compare performance of prompts which describe the legal rule to be applied (“description”) against prompts which reference the legal rule to be applied (“reference”). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

실험 결과

연구 질문

RQ1LLM이 수행할 수 있는 법적 추론의 유형은 무엇이며 세밀하고 도메인에 정합된 벤치마크에서 어떻게 측정될 수 있는가?
RQ2협력적이고 도메인 전문가 주도 프로세스가 법률에서 LLM 평가의 관련성과 활용성을 어떻게 향상시킬 수 있는가?
RQ3다양한 LLM이 상세한 법적 과제 유형 및 프롬프트 전략에서 어떻게 수행하는가?
RQ4LegalBench 과제가 미국 외 관할권 및 더 긴 문서로 얼마나 확장될 수 있는가?

주요 결과

LegalBench는 법적 프레임워크와 실무에서 도출된 여섯 가지 추론 유형에 걸친 162개 과제를 제공합니다.
이 벤치마크는 법적 맥락에서 LLM 성능을 연구하기 위한 표준화된 프롬프트, 시연 및 평가 프로토콜을 가능하게 합니다.
20개 LLM에 대한 초기 실험은 과제 유형 간에 서로 다른 강점을 보이고 프롬프트 엔지니어링 전략에 대한 통찰을 드러냅니다(논문에 자세히).
LegalBench는 실용적으로 유용하고 해석 가능한 평가를 보장하기 위해 과제 구성에서 도메인 전문가의 입력 중요성을 강조합니다.
해석적 및 계약 관련 과제에 의도적으로 중점을 두는데 이는 보편적인 법적 언어와 실용적 시사점 때문입니다.
저자들은 한계점(예: 영어 및 미국 법에 대한 초점, 짧은 컨텍스트 창)을 논의하고 향후 확장의 방향을 제시합니다.

Figure 2: We compare performance of prompts which describe the task in plain language to prompts which describe the task in technical legal language (for GPT-3.5). Error bars measure standard error, computed using a bootstrap with 1000 resamples.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.