QUICK REVIEW

[論文レビュー] ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics

Zhangir Azerbayev, Bartosz Piotrowski|arXiv (Cornell University)|Feb 24, 2023

Mathematics, Computing, and Information Processing被引用数 14

ひとこと要約

ProofNet は Lean 3 での371件の並列の非公式および公式の数学的命題と証明のベンチマークで、ベースライン結果と2つの新しい自形式化手法（プロンプト検索と蒸留バックトランスレーション）を含みます。

ABSTRACT

We introduce ProofNet, a benchmark for autoformalization and formal proving of undergraduate-level mathematics. The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3, a natural language theorem statement, and a natural language proof. The problems are primarily drawn from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology. We intend for ProofNet to be a challenging benchmark that will drive progress in autoformalization and automatic theorem proving. We report baseline results on statement autoformalization via in-context learning. Moreover, we introduce two novel statement autoformalization methods: prompt retrieval and distilled backtranslation.

研究の動機と目的

Provide a parallel dataset of informal and formal undergraduate mathematics statements and proofs in Lean 3 to drive autoformalization and theorem proving research.
Evaluate existing language models on autoformalization and informalization tasks within ProofNet.
Propose and assess techniques to boost autoformalization performance without large parallel corpora.
Demonstrate open-source models trained on mathematical data and analyze their strengths and limitations.

提案手法

Construct ProofNet with 371 parallel formal statements in Lean 3, corresponding natural language statements, and natural language proofs.
Evaluate in-context learning baselines for autoformalization using large language models.
Introduce prompt retrieval to augment few-shot prompts with relevant Lean mathlib statements.
Develop distilled backtranslation to fine-tune models for autoformalization without parallel data.

実験結果

リサーチクエスチョン

RQ1How well can large language models autoformalize informal theorem statements into Lean 3 formalizations?
RQ2Do retrieval-augmented prompts and distilled backtranslation improve autoformalization performance over few-shot baselines?
RQ3What are the strengths and failure modes of current models on formalization and informalization tasks in ProofNet?
RQ4Can open-source math-focused models trained on a math-rich corpus compete with black-box API baselines for autoformalization tasks?

主な発見

Model	Formalization Typecheck rate	Formalization BLEU	Formalization Accuracy	Informalization Compile rate	Informalization BLEU	Informalization Accuracy
Few-shot. proofGPT-1.3B	5.9	8.1	0	0.77	5.1	4.3
Few-shot. proofGPT-6.7B	4.3	4.7	0	0.70	6.0	6.5
Few-shot. Codex	23.7	25.1	13.4	100	13.2	62.3
Prompt retrieval. Codex	45.2	14.8	16.1	-	-	-
Dist. backtrans. proofGPT-1.3B	19.4	10.7	3.2	-	-	-

In-context learning baselines achieve nontrivial formalization performance but are far from perfect (e.g., 13.4% accuracy for formalization with Code-davinci-002 in few-shot settings).
Prompt retrieval substantially improves formalization accuracy and typecheck rate compared with standard few-shot prompting.
Distilled backtranslation improves autoformalization performance for smaller models beyond their in-context learning baselines.
Informalization is easier than formalization across models, with higher accuracy in generating informal proofs and statements.
BLEU correlates poorly with formalization performance, while typecheck rate serves as a better predictor of autoformalization success.
Code-davinci-002 shows strong semantic grasp when it produces typecheckable formalizations, though many outputs require careful verification.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。