QUICK REVIEW

[論文レビュー] Autoformalization with Large Language Models

Yuhuai Wu, Albert Q. Jiang|arXiv (Cornell University)|May 25, 2022

Mathematics, Computing, and Information Processing被引用数 41

ひとこと要約

大規模言語モデルは自然言語の数学をIsabelle/HOLへ翻訳することに顕著な成功を収めており（38/150 の完璧ケース、全体で25.3%）、自動公式化された定理を用いてニューラル証明支援ツールをMiniF2Fで35.2%へ向上させ、最新の最先回答を達成した。

ABSTRACT

Autoformalization is the process of automatically translating from natural language mathematics to formal specifications and proofs. A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence. While the long-term goal of autoformalization seemed elusive for a long time, we show large language models provide new prospects towards this goal. We make the surprising observation that LLMs can correctly translate a significant portion ($25.3\%$) of mathematical competition problems perfectly to formal specifications in Isabelle/HOL. We demonstrate the usefulness of this process by improving a previously introduced neural theorem prover via training on these autoformalized theorems. Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6\%$ to $35.2\%$.

研究の動機と目的

LLMs が自然言語の数学的命題を形式的な Isabelle/HOL コードへ自動公式化できることを実証する。
miniF2F由来データセット上で、人間による評価と BLEU スコアを用いて自動公式化の品質を評価する。
自動公式化された定理が expert iteration によってニューラル定理証明器を改善できることを示す。

提案手法

PaLM および Codex に自然言語の命題を Isabelle コードへ翻訳させるため、少数ショットのエグザンプルを用いたインカレント学習を使う。
miniF2F-algebra および miniF2F-number_theory のサブセット上で、人間の正解公式化を用いた BLEU と比較して翻訳を評価する。
150件の自動公式化に対して人間によるエラーレ分析を行い、失敗モードを特定する。
エキスパート・イテレーションループを適用する：基礎的な証明者で証明を生成し、成功した証明をトレーニングデータに追加して微調整し、改良された証明者を得る。

実験結果

リサーチクエスチョン

RQ1大規模言語モデルは自然言語の数学的命題をIsabelle/HOLへ高い忠実度で翻訳できるか？
RQ2モデルの規模や異なるモデル（PaLM の variant、Codex）が自動公式化の品質にどのように影響するか？
RQ3自動公式化された定理は miniF2F のような標準ベンチマークでニューラル定理証明器を改善できるか？
RQ4自動公式化における共通の失敗モードは何か、 prompting や例がそれらを緩和する可能性はあるか？

主な発見

モデル	有効	テスト
PACT	23.9%	24.6%
FMSCL	33.6%	29.6%
Base model (M0)	28.3%	29.9%
After 1 expert iteration (M1)	36.1%	34.0%
After 2 expert iterations (M2)	37.3%	35.2%

Codex と大規模 PaLM モデルはケースの一部（例：ケーススタディ1）に対して完璧な Isabelle 翻訳を生成でき、全体で 150 件の評価自動公式化のうち 25.3% が完璧である。
BLEUスコアはモデル規模とともに向上する：PaLM 8B（ algebra 31.49, number_theory 22.10 ）、PaLM 64B（ algebra 43.13, number_theory 31.43 ）、PaLM 540B（ algebra 50.30, number_theory 36.16 ）、Codex（ algebra 57.13, number_theory 43.33 ）。
自動公式化された定理を用いて expert iteration によるニューラル定理証明器の訓練は miniF2F で最新の成果を達成：ベースは test で 29.9%、1 回のイテレーション後 34.0%、2 回のイテレーション後 35.2% on test。
自動公式化データを用いた二回の expert iteration は、前の最先端より 5.6 ポイントの改善をもたらす。
ケーススタディは完璧な翻訳といくつかの失敗（例: informal 定義と Isabelle の概念との不整合）を示し、few-shot prompting の影響を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。