QUICK REVIEW

[论文解读] VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin, Qiaochu Chen|arXiv (Cornell University)|Feb 20, 2026

Logic, programming, and type systems被引用 0

一句话总结

VeriSoftBench 提供来自开源形式化方法项目的 500 条 Lean 4 证明义务基准测试，保留仓库上下文以评估证明器在项目特定的多文件依赖中的性能；结果显示来自以数学为中心的基准的迁移性有限，以及传递性依赖的强大影响。

ABSTRACT

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.

研究动机与目标

评估证明器在带有项目特定抽象和跨文件依赖的仓库规模 Lean 证明中的处理能力。
评估经过筛选上下文与完整仓库上下文对证明成功的影响。
表征传递性仓库依赖如何影响证明自动化性能。
提供基准测试与评估套件以推动仓库规模的形式化验证改进。

提出的方法

从覆盖验证领域的 23 个 Lean 仓库构建 VeriSoftBench。
保留仓库上下文，包括跨文件依赖和项目特定抽象。
定义两种上下文方案：经过筛选的（聚焦依赖）与完整仓库（整个本地仓库）。
使用生成功能-检查-修复循环对前沿大模型进行评估，以及如 Aristotle 等证明器的端到端使用。
分析证明成功与仓库依赖结构（直接依赖 vs 传递、深度）的相关性。
将基准测试及结果发布到 GitHub 以供社区使用。

Figure 1 : Contextual dependencies comparison between (a) mathematical benchmark proofs, (b) lightweight verification tasks, and (c) repository-scale verification. PutnamBench proofs rely almost entirely on library (Mathlib) dependencies (purple). Verina introduces a small number of project-specific

实验结果

研究问题

RQ1 Mathlib 为中心的证明器在仓库规模的验证任务中的迁移能力如何？
RQ2传递性仓库依赖与证明成功之间的关系如何？
RQ3提供经过筛选的本地上下文是否比完整仓库上下文提升证明器性能？
RQ4哪些常见模式和依赖结构会影响大型代码库中的证明自动化？

主要发现

Category	Model	Curated Context (Pass@8)	Full Context (Pass@8)
VeriSoftBench-Full	Claude Opus 4.5 (Pass@8, r=3)	31.2%	23.2%
VeriSoftBench-Full	GPT-5.2 (Pass@8, r=3)	12.6%	10.8%
VeriSoftBench-Full	Gemini-3-Pro (Pass@8, r=3)	41.0%	34.8%
VeriSoftBench-Full	Gödel-Prover-v2 (Pass@8, r=3)	5.6%	0.0%
VeriSoftBench-Aristotle	Aristotle	-	69%
VeriSoftBench-Aristotle	Gemini-3-Pro (r=3)	-	65%

frontier 大模型与专门证明器在 VeriSoftBench 任务上表现适度。
需要传递性、多跳、仓库本地依赖时，性能会下降以完成目标。
经过筛选的上下文相对于完整仓库上下文提升了性能，但仍有较大改进空间。
以 Mathlib 为中心的基准对仓库规模的验证任务的预测性较差。
完整上下文可能通过对经常出现的结构模式提供线索来帮助推断，而不仅仅是直接所需的依赖。
Aristotle 在包含同一文件引理的子集上达到 69%；Gemini-3-Pro 在该子集达到 65%，表明子集化更易。

Figure 3 : An example task instance from our benchmark. The goal is to synthesize a proof for the target theorem cexec_to_reds , which relates two definitions of program execution in a formalized programming language. The figure illustrates the context that must be provided to or retrieved by the pr

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。