QUICK REVIEW

[論文レビュー] Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

Chaoyuan Peng, Lei Wu|arXiv (Cornell University)|Mar 11, 2026

Ethics and Social Impacts of AI被引用数 0

ひとこと要約

要旨: 論文はEVMBenchをより多くのモデルファミリーとスキャフォールドで拡張し、汚染なしのIncidentsデータセットを導入してAIエージェントがスマートコントラクトの脆弱性を検出・悪用する能力を評価する。モデルランキングの不安定さ、現実世界での悪用の制限、そして自動化されたAI監査が近いという考えに挑戦する顕著なスキャフォールド効果を発見した。

ABSTRACT

EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.

研究の動機と目的

元のEVMBench設定を超えた、汚染なしのAIエージェントによるスマートコントラクトセキュリティ評価を厳密に行う動機付け。
モデルファミリーとスキャフォールドを拡張して、モデルの効果とツールの効果を分離する。
リリース後の現実世界のセキュリティインシデントの汚染なしデータセットIncidentsを導入し、ベンチマーク性能の現実世界転移を検証する。
AIエージェントがエンドツーエンドの悪用を達成できるかを評価し、スキャフォールド選択が結果に与える影響を調べる。

提案手法

Code4renaリポジトリから120の脆弱性でEVMBenchインフラを再現する。
モデルファミリー4つとスキャフォールド3つで計26のエージェント構成に拡張し、モデル効果とスキャフォールド効果を分離する。
現実世界のリリース後セキュリティインシデント22件の汚染なしIncidentsデータセットを構築して現実世界転移を検証する。
DetectとExploitタスクを使用（Patchは除外）、Detectはモデルベースのグレーディング、Exploitはオンチェーン検証で評価する。
3つのジャッジモデル（GPT-5系）を交差検証してグレーダーの信頼性を評価する。

実験結果

リサーチクエスチョン

RQ1AIエージェントの脆弱性検出ランキングは、モデルファミリーとスキャフォールドを拡大しても安定するか。
RQ2 curated EVMbenchデータでの性能パターンは、汚染なしの現実世界インシデントへ転移するか。
RQ3エージェントのスキャフォールディングは検出と悪用の結果にどう影響するか。
RQ4現実世界のインシデントでAIエージェントはエンドツーエンドの悪用を達成できるか、発見と悪用のボトルネックは何を示すか。

主な発見

Rank	Agent Configuration	Scaffold	Score	Score (%)	Tasks w/ > 0
1	Claude Opus 4.6	CC	57	47.5%	30/40
2	Gemini 3.1 Pro +tools	OC	45	37.5%	30/39
3	Claude Opus 4.5	OC	43	35.8%	24/35
4	Gemini 3.1 Pro	OC	42	35.0%	27/39
5	Claude Opus 4.5	CC	37	30.8%	24/38
6	Claude Opus 4.6	CC	35	29.2%	24/40
6	Claude Sonnet 4.5	OC	35	29.2%	24/40
6	GPT-5.3-Codex (low)	OC	35	29.2%	25/40
9	GPT-5.2 (high)	Codex	34	28.3%	23/40
9	GPT-5.2 (xhigh)	Codex	34	28.3%	22/40
11	GPT-5.3-Codex (low)	Codex	33	27.5%	23/40
11	GPT-5.3-Codex (high)	OC	33	27.5%	23/40
13	Claude Sonnet 4.5	CC	32	26.7%	21/38
14	GPT-5.2 (medium)	Codex	31	25.8%	22/40
14	GPT-5.3-Codex (xhigh, agentic)	Codex	31	25.8%	21/40
16	GPT-5.3-Codex (xhigh)	Codex	30	25.0%	22/40
17	GPT-5.2 (low)	Codex	29	24.2%	21/40
17	GPT-5.3-Codex (medium)	OC	29	24.2%	21/40
19	GPT-5.3-Codex (high, agentic)	Codex	28	23.3%	21/40
19	GPT-5.3-Codex (xhigh)	OC	28	23.3%	18/40
21	GPT-5.3-Codex (high)	Codex	27	22.5%	19/40
21	GPT-5.3-Codex (medium, agentic)	Codex	27	22.5%	22/40
23	GPT-5.3-Codex (high, agentic)	Codex	26	21.7%	20/40
23	GPT-5.3-Codex (medium)	Codex	26	21.7%	20/40
25	GLM-5	OC	25	20.8%	19/40
26	Gemini 3 Pro Preview	OC	20	16.7%	16/40

エージェントの脆弱性検出ランキングは、設定・タスク・データセット全体で不安定で、モデルランキングが大幅に変動する。
汚染なしIncidentsデータセットでは、全110組のエージェント-インシデントでエンドツーエンドの悪用を達成するエージェントは現れず、いくつかは脆弱性の最大65%を検出する。
エージェントのスキャフォールディングは結果に実質的な影響を与え、オープンソースのスキャフォールドがベンダー製の選択肢を最大5ポイント差で上回る。
現実世界のインシデントの検出は、 curated EVMbenchデータから直接転移しないため、発見が主要なボトルネックであるとの結論を覆す。
悪用タスクでは、検出とは異なる能力が各タスクを支配することを示し、人間の介在を前提とするワークフローを支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。