Skip to main content
QUICK REVIEW

[论文解读] Learning to Disprove: Formal Counterexample Generation with Large Language Models

Zenan Li, Zhaoyu Li|arXiv (Cornell University)|Mar 19, 2026
Mathematics, Computing, and Information Processing被引用 0
一句话总结

论文训练大语言模型生成形式化的反例与 Lean 4 验证的证明,利用符号变异数据合成和多奖励专家迭代框架,在基线基础上取得显著提升。

ABSTRACT

Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI efforts in mathematics focus almost exclusively on proof construction, often neglecting the equally important task of finding counterexamples. In this paper, we address this gap by fine-tuning large language models (LLMs) to reason about and generate counterexamples. We formalize this task as formal counterexample generation, which requires LLMs not only to propose candidate counterexamples but also to produce formal proofs that can be automatically verified in the Lean 4 theorem prover. To enable effective learning, we introduce a symbolic mutation strategy that synthesizes diverse training data by systematically extracting theorems and discarding selected hypotheses, thereby producing diverse counterexample instances. Together with curated datasets, this strategy enables a multi-reward expert iteration framework that substantially enhances both the effectiveness and efficiency of training LLMs for counterexample generation and theorem proving. Experiments on three newly collected benchmarks validate the advantages of our approach, showing that the mutation strategy and training framework yield significant performance gains.

研究动机与目标

  • 说明在数学推理和形式化验证中需要反例生成的动机。
  • 开发数据合成管道以创建多样化的反例问题。
  • 提出多奖励训练机制以同时提升反例提议与形式化证明生成。
  • 在 Lean 4 基准上对大语言模型进行反例搜索与自动形式化验证的微调。

提出的方法

  • 将形式化反例生成框定为两阶段任务:非形式化反例提议随后在 Lean 4 中进行形式化证明验证。
  • 符号变异(Lean 4 tactic mutate)通过舍弃假设来创建改动定理,从而产生反例问题。
  • 多奖励专家迭代:训练两种大语言模型(一个负责反例,一个负责证明),通过证明改动定理和证明舍弃假设两者的奖励来实现双重奖励。
  • 加权监督微调,使用奖励 r_i = alpha * I(对改动定理的证明) + (1-alpha) * I(对舍弃假设的证明)。
  • 大规模数据合成,来自不同种子源(Mathlib、Leanworkbook、MiniF2F、PutnamBench)产生约57.5万道反例问题。
  • 在包括 For-Counter 与 Veri-Formalize 任务的三个基准上进行评估,显示通过率和解决问题的绝对数量均有所提升。
Figure 1: Framework of counterexample training. In the data synthesis stage, the symbolic mutation drops the hypothesis of a provable theorem, creating new counterexample problems. In the subsequent expert stage, two rewards are introduced based on whether the generated counterexample can prove the
Figure 1: Framework of counterexample training. In the data synthesis stage, the symbolic mutation drops the hypothesis of a provable theorem, creating new counterexample problems. In the subsequent expert stage, two rewards are introduced based on whether the generated counterexample can prove the

实验结果

研究问题

  • RQ1RQ1:数据变异在生成反例问题方面的有效性与效率。
  • RQ2RQ2:多奖励训练相对于单一奖励训练的有效性与效率。
  • RQ3RQ3:整合框架在反例生成与形式化验证任务上的总体性能提升。

主要发现

  • 基于变异的数据合成产出约57.5万道反例问题,变异率在1.65–2.48之间,平均每个种子定理用时0.3–0.71秒。
  • 多奖励训练收敛更快,最终的 pass@k 指标高于单奖励训练(pass@1:约49% 对比约43%,pass@4:约52% 对比约46%,pass@9:约54% 对比约47%)。
  • 微调后的模型在三个基准的反例生成任务上显著优于最先进的推理模型(在 pass@1、pass@4、pass@9 的表现均更优,且在三个基准的 pass@1 上相对于最强基线分别多解决了 95、69、63 道题)。
  • 整合工作流在1) 反例识别、2) 自动形式化结果的验证、3) 推理步骤的验证方面均实现优越性能,作者报告相对于专有与开源证明器的显著改进。
Figure 2: Task of formal counterexample generation. This task requires the LLM first to perform informal reasoning to identify a valid counterexample for the given problem, and then generate the corresponding formal proof, which is automatically verified by theorem provers (e.g., Lean 4).
Figure 2: Task of formal counterexample generation. This task requires the LLM first to perform informal reasoning to identify a valid counterexample for the given problem, and then generate the corresponding formal proof, which is automatically verified by theorem provers (e.g., Lean 4).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。