QUICK REVIEW

[논문 리뷰] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Kaiyu Yang, Aidan Swope|arXiv (Cornell University)|2023. 06. 27.

Mathematics, Computing, and Information Processing인용 수 38

한 줄 요약

LeanDojo는 데이터, 모델 및 벤치마크를 포함한 오픈 소스 Lean 플레이그라운드를 도입하고, Lean의 수학 라이브러리에서 Premises를 검색하여 이론 정리를 개선하는 Retrieval-augmented prover인 ReProver를 제시한다.

ABSTRACT

Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection: a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): an LLM-based prover augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 98,734 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

연구 동기 및 목표

Lean 데이터를 추출하고 Lean과 프로그래밍적으로 상호작용하기 위한 개방적이고 재현 가능한 도구를 제공한다.
mathlib에서 전제를 선택하여 전 tactics을 생성하는 Retrieval-augmented prover(ReProver)를 개발한다.
전제 선택 및 증명 성능을 평가하기 위한 크고 도전적인 Lean 기반 벤치마크를 만든다.
Retrieve augmentation이 LeanMath 벤치마크에서 비검색 기준 및 GPT-4 대비 증명 성능을 향상시킨다는 것을 보여준다.

제안 방법

LeanDojo는 Lean에서 런타임 증명 데이터(상태, 전술, 전제)를 추출하고, 전제 명명과 접근성 정보를 포함한 전체 정보를 Lean에 보강한다.
ReProver는 소수의 검색된 전제에 조건을 두고 동작하는 Retrieval-augmented 전술 생성기를 사용한다.
전제 검색은 Dense Passage Retriever를 기반으로 하지만 접근 가능한 전제로 제한하고 파일 내 부정 예제를 이용하는 개선점을 포함한다.
states+premises로 T5 인코더-디코더를 미세조정하여 전술을 생성하고, 이후 최선-우선 탐색으로 증명을 수행한다.
LeanDojo Benchmark 데이터셋은 98,734개의 정리/증명과 130,262개의 전제로 구성되며, 일반화 테스트를 위한 novel_premises 데이터 분할을 포함한다.

실험 결과

연구 질문

RQ1검색된 전제를 사용한 Retrieval-augmented 프롬프팅이 비검색 기준에 비해 Lean에서 인터랙티브한 정리 증명을 개선하는가?
RQ2접근 가능한 전제로 검색을 제한하고 파일 내 부정 예제를 사용하는 것이 전제 회상 및 증명 성공에 어떤 영향을 미치는가?
RQ3LeanDojo Benchmark의 랜덤 분할과 novel_premises 분할이 새로운 전제로의 일반화에 어떤 영향을 미치는가?
RQ4ReProver가 외부 데이터셋 MiniF2F와 ProofNet에서 기존 방법들(강력한 RL 없이) 대비 어떤 성능을 보이는가?

주요 결과

방법	무작위 R@1	무작위 R@10	무작위 MRR	새로운 전제 R@1	새로운 전제 R@10	새로운 전제 MRR
BM25	6.7	17.2	0.15	5.9	15.5	0.14
w/ all premises	1.9	11.9	0.08	2.1	12.4	0.08
Ours	13.5	38.4	0.31	9.1	27.6	0.24
w/ all premises	11.7	36.2	0.27	7.1	23.1	0.20
w/o in-file negatives	10.8	33.1	0.25	7.9	25.7	0.22

ReProver는 LeanDojo Benchmark 랜덤 분할에서 51.2% Pass@1을 달성하여 비검색 기준(47.6%)과 GPT-4(29.0%)를 능가한다.
novel_premises에서 ReProver는 26.3% Pass@1을 달성하여 비검색 기준(23.2%)과 GPT-4(7.4%)를 앞선다.
접근 가능한 전제 및 파일 내 부정 예제를 포함한 전제 검색은 베이스라인에 비해 재현율 지표를 크게 개선한다(예: 표 1에서 Our의 R@1은 13.5이고 BM25는 6.7이다).
ReProver는 MiniF2F 테스트 세트의 26.5%와 ProofNet의 13.8%를 Lean에서 증명하며, 최신 비-RL 방법과 경쟁하고 Lean 증명이 없는 수십 개의 증명을 발견한다.
학습은 단일 GPU에서 5일의 시간, 8개의 GPU에서의 평가, 오픈 소스 코드, 데이터 및 모델을 MIT 라이선스로 공개하는 것이 특징이다.
LeanDojo Benchmark는 수학 중심의 가장 큰 정리 증명 데이터셋 중 하나로, 도전적인 일반화 데이터 분할에 중점을 둔다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.