QUICK REVIEW

[论文解读] Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

Chaoyuan Peng, Lei Wu|arXiv (Cornell University)|Mar 11, 2026

Ethics and Social Impacts of AI被引用 0

一句话总结

论文通过引入更多模型家族和骨架，扩展 EVMBench，并引入一个无污染的 Incidents 数据集来测试 AI 代理检测并利用智能合约漏洞的能力，发现模型排名不稳定、真实世界利用有限，以及对骨架的显著影响挑战了完全自动化 AI 审计近在咫尺的想法。

ABSTRACT

EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.

研究动机与目标

在原始 EVMBench 设置之外，推动对 AI 代理在智能合约安全领域的严格、无污染的评估。
在更多模型家族和骨架上扩展评估，以区分模型效应与工具效应。
引入一个无污染的 Incidents 数据集，包含真实世界的发布后安全事件，以测试基准性能的真实世界转移。
评估 AI 代理是否能够实现端到端的利用，以及骨架选择如何影响结果。

提出的方法

使用 Code4rena 仓库中的 120 个漏洞 Replicate EVMBench 基础设施。
扩展至四个模型家族和三种骨架的 26 种代理配置，以区分模型与骨架效应。
构建一个无污染的 Incidents 数据集，包含 22 个真实世界的发布后安全事件，以测试真实世界转移。
在 Detect 任务中使用基于模型的评分，在 Exploit 任务中使用链上验证（未包含 Patch），以进行评估。
通过交叉测试三种评审模型（GPT-5 变体）来评估评分标准的可靠性。

实验结果

研究问题

RQ1在扩大模型家族与骨架时，AI 代理的漏洞检测排序是否仍然稳定？
RQ2在经过筛选的 EVMbench 数据上的性能模式是否会转移到无污染的真实世界事故？
RQ3代理的骨架如何影响检测和利用的结果？
RQ4AI 代理是否能够在真实世界事故上实现端到端的利用，这对发现与利用瓶颈有何含义？

主要发现

在不同配置、任务和数据集上，代理的漏洞检测排序不稳定，模型排名会出现显著变化。
在无污染的 Incidents 数据集上，任意代理都无法在所有 110 对代理-事件中实现端到端的利用，尽管有些代理能检测出多达 65% 的漏洞。
代理骨架对结果有实质性影响，开源骨架在受控对比中比厂商替代方案高出最多 5 个百分点。
对真实世界事故的检测并不能直接从筛选后的 EVMbench 数据推导，表明发现并不是主要瓶颈。
在利用任务中，表现与检测不同步，表明不同能力驱动各自任务，支持需要人机协同的工作流。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。