QUICK REVIEW

[论文解读] Autoformalization with Large Language Models

Yuhuai Wu, Albert Q. Jiang|arXiv (Cornell University)|May 25, 2022

Mathematics, Computing, and Information Processing被引用 41

一句话总结

大型语言模型能够将自然语言的数学翻译成 Isabelle/HOL，取得显著成功（38/150 完美案例，25.3% 总体），并将自动形式化的定理用于提升一个神经证明器，在 MiniF2F 上达到 35.2%，实现了新的最先进结果。

ABSTRACT

Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs. A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion ($25.3\%$) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6\%$ to $35.2\%$.

研究动机与目标

证明大型语言模型能够将自然语言的数学陈述自动形式化为正式的 Isabelle/HOL 代码。
通过对 miniF2F 派生数据集进行人工评估和 BLEU 分数评估，评估自动形式化的质量。
展示自动形式化的定理可以通过专家迭代改进神经定理证明器。

提出的方法

使用上下文学习并提供少量示例来提示 PaLM 和 Codex 将自然语言陈述翻译为 Isabelle 代码。
用 BLEU 相对于 miniF2F-algebra 和 miniF2F-number_theory 子集的人类真值形式化来评估翻译。
对 150 个自动形式化进行人工错误分析以确定失效模式。
应用专家迭代循环：使用基础证明器生成证明，将成功的证明加入训练数据，并进行微调以获得改进的证明器。

实验结果

研究问题

RQ1大型语言模型是否能以高保真度将自然语言的数学陈述翻译为 Isabelle/HOL？
RQ2模型规模和不同模型（PaLM 变体、Codex）如何影响自动形式化的质量？
RQ3自动形式化的定理能否在像 miniF2F 这样的标准基准上提升神经定理证明器？
RQ4在自动形式化中的常见失效模式有哪些，以及提示或示例如何缓解？

主要发现

模型	有效	测试
PACT	23.9%	24.6%
FMSCL	33.6%	29.6%
Base model (M0)	28.3%	29.9%
After 1 expert iteration (M1)	36.1%	34.0%
After 2 expert iterations (M2)	37.3%	35.2%

Codex 和大型 PaLM 模型在某些病例子集（例如 Case Study 1）中能够产生完美的 Isabelle 翻译，且在 150 个评估的自动形式化中总体有 25.3% 是完美的。
BLEU 分数随模型规模的增加而提升：PaLM 8B ( algebra 31.49, number_theory 22.10 ), PaLM 64B ( algebra 43.13, number_theory 31.43 ), PaLM 540B ( algebra 50.30, number_theory 36.16 ), Codex ( algebra 57.13, number_theory 43.33 ).
使用自动形式化的定理通过专家迭代训练神经定理证明器，在 miniF2F 上达到最先进水平：测试集基线 29.9%，经过 1 次迭代 34.0%，经过 2 次迭代 35.2% 在测试集。
两轮专家迭代结合自动形式化数据相比先前的最先进水平实现 5.6 个百分点的提升。
案例研究展示了完美翻译和一些失败（如非正式定义与 Isabelle 概念不对齐），并展示了少量示例提示的影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。