QUICK REVIEW

[论文解读] AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao, Ziran Yang|arXiv (Cornell University)|Feb 10, 2026

Software Testing and Debugging Techniques被引用 0

一句话总结

AlgoVeri 在 Dafny、Verus、Lean 上基准已验证代码生成，涵盖77个经典算法，揭示跨语言差距及迭代修复的动态。

ABSTRACT

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

研究动机与目标

促进对经典算法的 vericoding 的公平、跨语言评估，具全局不变量。
在 SMT 基于与交互式定理证明验证系统之间实现语义对齐。
量化模型性能并识别在 Dafny、Verus、Lean 的工具链瓶颈。
分析前沿模型与开放模型在推理时间计算动态与错误模式。

提出的方法

创建一个包含77道教科书风格的算法题语料库，并在Dafny、Verus、Lean之间对齐规范。
通过多轮 refined 的目标大语言模型评估，利用编译器/验证器反馈直至验证成功。
使用语义验证器筛选超出编译器验证的解，确保算法保真性。
在不同语言间比较前沿模型与开放权重模型，以按算法类识别性能差距。
执行等计算分析，比较深度修复与基于并行采样的修复。

实验结果

研究问题

RQ1当被要求处理全局不变量算法时，LLMs 能否生成被 SMT 基于和 ITP 验证系统接受的代码和证明？
RQ2在对齐规范下，Dafny、Verus、Lean 的性能与失败模式有何差异？
RQ3在实现 vericoding 成功方面，模型能力与验证系统的相对贡献是什么？
RQ4开放模型的迭代修复策略是否带来显著改进，与前沿模型相比有何差异？

主要发现

前沿模型在 Dafny 的验证通过率较高（在语义过滤后最高达到 40.3% 的编译器验证），在 Verus 为 24.7%，在 Lean 为 7.8%。
AlgoVeri 揭示跨语言及算法类别的巨大性能差距，特别是在图算法和全局不变量方面，在 Verus 和 Lean 上尤其具有挑战性。
前沿模型在 Dafny 和 Verus 上通过迭代修复持续改进，在某些情况下通过率翻倍，而开放模型更早趋于饱和（如 GPT-OSS-120B）。
等计算分析表明，对开放模型而言，修复深度的收益低于并行采样，深度修复对当前架构的效果有限。
语言设计影响精炼轨迹：Dafny 支持以逻辑为导向的改进，Verus/Lean 则在语法/语义约束与搜索复杂性上设限，阻碍进展。
table_headers:[],
table_rows:[]}

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。