QUICK REVIEW

[논문 리뷰] Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?

Chaoyuan Peng, Lei Wu|arXiv (Cornell University)|2026. 03. 11.

Ethics and Social Impacts of AI인용 수 0

한 줄 요약

논문은 모델 계열과 스캐폴드를 더 확장한 EVMBench를 확장하고 오염 없이 Incidents 데이터셋을 도입해 AI 에이전트의 스마트 계약 취약점 탐지 및 악용 능력을 테스트하며, 모델 랭크의 불안정성과 실제 세계 악용의 제한성, 완전 자동 AI 감사 아이디어를 저해하는 강력한 스캐폴드 효과를 발견한다.

ABSTRACT

EVMbench, released by OpenAI, Paradigm, and OtterSec, is the first large-scale benchmark for AI agents on smart contract security. Its results -- agents detect up to 45.6% of vulnerabilities and exploit 72.2% of a curated subset -- have fueled expectations that fully automated AI auditing is within reach. We identify two limitations: its narrow evaluation scope (14 agent configurations, most models tested on only their vendor scaffold) and its reliance on audit-contest data published before every model's release that models may have seen during training. To address these, we expand to 26 configurations across four model families and three scaffolds, and introduce a contamination-free dataset of 22 real-world security incidents postdating every model's release date. Our evaluation yields three findings: (1) agents' detection results are not stable, with rankings shifting across configurations, tasks, and datasets; (2) on real-world incidents, no agent succeeds at end-to-end exploitation across all 110 agent-incident pairs despite detecting up to 65% of vulnerabilities, contradicting EVMbench's conclusion that discovery is the primary bottleneck; and (3) scaffolding materially affects results, with an open-source scaffold outperforming vendor alternatives by up to 5 percentage points, yet EVMbench does not control for this. These findings challenge the narrative that fully automated AI auditing is imminent. Agents reliably catch well-known patterns and respond strongly to human-provided context, but cannot replace human judgment. For developers, agent scans serve as a pre-deployment check. For audit firms, agents are most effective within a human-in-the-loop workflow where AI handles breadth and human auditors contribute protocol-specific knowledge and adversarial reasoning. Code and data: https://github.com/blocksecteam/ReEVMBench/.

연구 동기 및 목표

원래 EVMBench 설정을 넘어 스마트 계약 보안을 위한 AI 에이전트의 오염 없는 평가를 엄밀하게 동기화한다.
모델 계열과 스캐폴드 간의 확장을 통해 모델 효과와 툴링 효과를 분리한다.
런칭 직후의 실제 보안 사건을 포함하는 오염 없는 Incidents 데이터셋을 도입해 벤치마크 perf의 실제 세계 전이를 테스트한다.
AI 에이전트가 엔드 투 엔드 악용을 달성할 수 있는지와 스캐폴드 선택이 결과에 어떤 영향을 미치는지 평가한다

제안 방법

Code4rena 저장소의 120개 취약점을 포함한 EVMBench 인프라를 재현한다.
네 가지 모델 계열과 세 가지 스캐폴드에서 26개의 에이전트 구성을 확장해 모델 효과와 스캐폴드 효과를 구분한다.
22건의 실제 세계 포스트-릴리스 보안 사고로 구성된 오염 없는 Incidents 데이터셋을 만들어 실제 세계 전이를 테스트한다.
Detect 및 Exploit 작업(패치 제외)을 모델 기반 채점으로 평가하고 Exploit은 온 체인 검증으로 평가한다.
세 명의 재판관 모델(GPT-5 변형)을 교차 테스트하여 채점기의 신뢰성을 평가한다

실험 결과

연구 질문

RQ1AI 에이전트의 취약점 탐지 랭크가 모델 계열과 스캐폴드를 확장할 때도 안정적으로 유지되는가?
RQ2큐레이션된 EVMbench 데이터에서의 성능 패턴이 오염 없는 실제 세계 사건으로 전이되는가?
RQ3에이전트의 스캐폴딩이 탐지 및 악용 결과에 어떤 영향을 미치는가?
RQ4AI 에이전트가 실제 세계 사건에서 엔드 투 엔드 악용을 달성할 수 있는가, 그리고 이것이 발견 vs 악용의 병목에 대해 어떤 시사점을 주는가?

주요 결과

Rank	Agent Configuration	Scaffold	Score	Score (%)	Tasks w/ > 0
1	Claude Opus 4.6	CC	57	47.5%	30/40
2	Gemini 3.1 Pro +tools	OC	45	37.5%	30/39
3	Claude Opus 4.5	OC	43	35.8%	24/35
4	Gemini 3.1 Pro	OC	42	35.0%	27/39
5	Claude Opus 4.5	CC	37	30.8%	24/38
6	Claude Opus 4.6	CC	35	29.2%	24/40
6	Claude Sonnet 4.5	OC	35	29.2%	24/40
6	GPT-5.3-Codex (low)	OC	35	29.2%	25/40
9	GPT-5.2 (high)	Codex	34	28.3%	23/40
9	GPT-5.2 (xhigh)	Codex	34	28.3%	22/40
11	GPT-5.3-Codex (low)	Codex	33	27.5%	23/40
11	GPT-5.3-Codex (high)	OC	33	27.5%	23/40
13	Claude Sonnet 4.5	CC	32	26.7%	21/38
14	GPT-5.2 (medium)	Codex	31	25.8%	22/40
14	GPT-5.3-Codex (xhigh, agentic)	Codex	31	25.8%	21/40
16	GPT-5.3-Codex (xhigh)	Codex	30	25.0%	22/40
17	GPT-5.2 (low)	Codex	29	24.2%	21/40
17	GPT-5.3-Codex (medium)	OC	29	24.2%	21/40
19	GPT-5.3-Codex (high, agentic)	Codex	28	23.3%	21/40
19	GPT-5.3-Codex (xhigh)	OC	28	23.3%	18/40
21	GPT-5.3-Codex (high)	Codex	27	22.5%	19/40
21	GPT-5.3-Codex (medium, agentic)	Codex	27	22.5%	22/40
23	GPT-5.3-Codex (high, agentic)	Codex	26	21.7%	20/40
23	GPT-5.3-Codex (medium)	Codex	26	21.7%	20/40
25	GLM-5	OC	25	20.8%	19/40
26	Gemini 3 Pro Preview	OC	20	16.7%	16/40

구성, 작업 및 데이터 세트 전반에서 에이전트 취약점 탐지 랭크가 불안정하며 모델 랭킹이 크게 변동한다.
오염 없는 Incidents 데이터셋에서 어떤 에이전트도 110쌍의 에이전트-사건 모두에서 엔드 투 엔드 악용을 달성하지 못했으며, 일부는 최대 65%의 취약점을 탐지한다.
에이전트 스캐폴딩이 실질적으로 영향을 미치며, 오픈 소스 스캐폴드가 컨트롤된 비교에서 벤더 대안에 최대 5포인트의 차이로 우수했다.
실제 세계 사건에 대한 탐지는 큐레이션된 EVMbench 데이터에서 직접적으로 전이되지 않으며, 발견이 주요 병목이라는 결론에 도전한다.
악용 작업에서 성능은 탐지와 다르게 나타나며, 각 작업을 구동하는 능력이 서로 다름을 시사하고 인간-루프 워크플로를 지지한다

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.