Skip to main content
QUICK REVIEW

[论文解读] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin, Qiaochu Chen|arXiv (Cornell University)|Feb 20, 2026
Logic, programming, and type systems被引用 0
一句话总结

VeriSoftBench 提供来自开源形式化方法项目的 500 条 Lean 4 证明义务基准测试,保留仓库上下文以评估证明器在项目特定的多文件依赖中的性能;结果显示来自以数学为中心的基准的迁移性有限,以及传递性依赖的强大影响。

ABSTRACT

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.

研究动机与目标

  • 评估证明器在带有项目特定抽象和跨文件依赖的仓库规模 Lean 证明中的处理能力。
  • 评估经过筛选上下文与完整仓库上下文对证明成功的影响。
  • 表征传递性仓库依赖如何影响证明自动化性能。
  • 提供基准测试与评估套件以推动仓库规模的形式化验证改进。

提出的方法

  • 从覆盖验证领域的 23 个 Lean 仓库构建 VeriSoftBench。
  • 保留仓库上下文,包括跨文件依赖和项目特定抽象。
  • 定义两种上下文方案:经过筛选的(聚焦依赖)与完整仓库(整个本地仓库)。
  • 使用生成功能-检查-修复循环对前沿大模型进行评估,以及如 Aristotle 等证明器的端到端使用。
  • 分析证明成功与仓库依赖结构(直接依赖 vs 传递、深度)的相关性。
  • 将基准测试及结果发布到 GitHub 以供社区使用。
Figure 1 : Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. PutnamBench proofs rely almost entirely on library (Mathlib) dependencies (purple). Verina introduces a small number of project-specific
Figure 1 : Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. PutnamBench proofs rely almost entirely on library (Mathlib) dependencies (purple). Verina introduces a small number of project-specific

实验结果

研究问题

  • RQ1 Mathlib 为中心的证明器在仓库规模的验证任务中的迁移能力如何?
  • RQ2传递性仓库依赖与证明成功之间的关系如何?
  • RQ3提供经过筛选的本地上下文是否比完整仓库上下文提升证明器性能?
  • RQ4哪些常见模式和依赖结构会影响大型代码库中的证明自动化?

主要发现

CategoryModelCurated Context (Pass@8)Full Context (Pass@8)
VeriSoftBench-FullClaude Opus 4.5 (Pass@8, r=3)31.2%23.2%
VeriSoftBench-FullGPT-5.2 (Pass@8, r=3)12.6%10.8%
VeriSoftBench-FullGemini-3-Pro (Pass@8, r=3)41.0%34.8%
VeriSoftBench-FullGödel-Prover-v2 (Pass@8, r=3)5.6%0.0%
VeriSoftBench-AristotleAristotle-69%
VeriSoftBench-AristotleGemini-3-Pro (r=3)-65%
  • frontier 大模型与专门证明器在 VeriSoftBench 任务上表现适度。
  • 需要传递性、多跳、仓库本地依赖时,性能会下降以完成目标。
  • 经过筛选的上下文相对于完整仓库上下文提升了性能,但仍有较大改进空间。
  • 以 Mathlib 为中心的基准对仓库规模的验证任务的预测性较差。
  • 完整上下文可能通过对经常出现的结构模式提供线索来帮助推断,而不仅仅是直接所需的依赖。
  • Aristotle 在包含同一文件引理的子集上达到 69%;Gemini-3-Pro 在该子集达到 65%,表明子集化更易。
Figure 3 : An example task instance from our benchmark. The goal is to synthesize a proof for the target theorem cexec_to_reds , which relates two definitions of program execution in a formalized programming language. The figure illustrates the context that must be provided to or retrieved by the pr
Figure 3 : An example task instance from our benchmark. The goal is to synthesize a proof for the target theorem cexec_to_reds , which relates two definitions of program execution in a formalized programming language. The figure illustrates the context that must be provided to or retrieved by the pr

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。