QUICK REVIEW

[论文解读] MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics

Xinyu Liu, Zixuan Xie|arXiv (Cornell University)|Jan 30, 2026

Mathematics, Computing, and Information Processing被引用 0

一句话总结

论文提出 MathlibLemma，这是一个四代理LLM框架，能自动发现、形式化并证明 Mathlib 中的民间公理，并引入了包含 4,028 任务的 MathlibLemma 基准测试，以及一个经验证的 1,812 个证明库。

ABSTRACT

While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like LaTeX or Maple. To address this, we introduce MathlibLemma, the first LLM-based multi-agent system to automate the discovery and formalization of mathematical folklore lemmas. This framework constitutes our primary contribution, proactively mining the missing connective tissue of mathematics. Its efficacy is demonstrated by the production of a verified library of folklore lemmas, a subset of which has already been formally merged into the latest build of Mathlib, thereby validating the system's real-world utility and alignment with expert standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work establishes a constructive methodology for the self-evolution of formal mathematical libraries.

研究动机与目标

Identify and formalize missing folklore lemmas in Mathlib to reduce last-mile gaps in formalization workflows.
Propose a scalable, multi-agent LLM pipeline (Discovery, Judge, Formalizer, Prover) that yields syntactically valid and semantically sound Lean lemmas with proofs.
Create a large, type-checked folklore lemma benchmark to evaluate and guide LLM-based formal reasoning systems.
Provide a verified library of folklore proofs and demonstrate partial upstreaming into Mathlib to validate real-world utility.

提出的方法

Four-agent pipeline where Discovery generates candidate Lean statements from Mathlib seeds.
Judge filters candidates for mathematical correctness using an LLM-based verdict.
Formalizer fixes syntax/type errors by interacting with a Lean server to ensure compilable statements.
Prover attempts to generate and verify a Lean proof; failures are repaired in a loop up to two tries; kernel verification ensures validity.

Figure 1 : Overview of MathlibLemma . A multi-agent pipeline where the Discovery Agent mines candidates from Mathlib seeds, followed by semantic filtering (Judge), syntactic repair (Formalizer), and proof generation (Prover), yielding a verified library and benchmark.

实验结果

研究问题

RQ1Can an automated pipeline reliably discover missing folklore lemmas from an existing library seed context?
RQ2How well can LLMs filter, formalize, and prove folklore lemmas in Lean/Mathlib while minimizing hallucinations?
RQ3What is the solvability and quality of a large folklore lemma benchmark under current provers?
RQ4To what extent can generated folklore proofs be upstreamed into Mathlib and accepted by human mathematicians?

主要发现

Model	Foundational	Applied	Abstract	Total
GPT	19.81	17.47	21.30	19.81
GPT-Reasoning	22.32	19.67	23.98	22.32
Kimina	14.37	12.12	15.28	14.37
Goedel	21.18	29.63	15.54	12.96
DeepSeek32B	7.05	11.11	3.60	3.89
DeepSeek70B	6.73	10.95	3.60	2.96
Qwen	2.81	2.05	2.99	3.89
Union (All Models)	44.99	50.86	37.14	50.86

A benchmark of 4,028 type-checked Lean statements spanning Foundational, Applied, and Abstract domains was built.
A rigorous audit showed 78% of sampled unsolved instances are provable by humans, indicating high intrinsic validity of statements.
State-of-the-art models collectively solved 44.99% of the benchmark (Success@2) with notable diversity benefits over any single model.
Specialist Goedel Prover achieves 29.63% in Foundational but drops to 12.96% in Abstract, highlighting generalist-vs-specialist trade-offs.
Ensembling diverse models yields substantial gains, with Union performance exceeding the best individual model by a large margin (44.99% vs 22.32%).
1,812 proofs were generated and verified, with 3 lemmas upstreamed into Mathlib, demonstrating real-world utility.

Figure 2 : MathlibLemma taxonomy and composition. The benchmark is partitioned into three distinct domains (inner ring): Foundational , Applied , and Abstract . The outer ring shows topic areas used to source seed contexts, with representative examples in parentheses.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。