QUICK REVIEW

[Paper Review] LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu|arXiv (Cornell University)|Apr 11, 2024

Web Application Security Vulnerabilities9 citations

TL;DR

The paper shows that a GPT-4-based LLM agent, given CVE descriptions and tools, can autonomously exploit real-world one-day vulnerabilities with an 87% success rate, while other models and scanners fail.

ABSTRACT

LLMs have becoming increasingly powerful, both in their benign and malicious uses. With the increase in capabilities, researchers have been increasingly interested in their ability to exploit cybersecurity vulnerabilities. In particular, recent work has conducted preliminary studies on the ability of LLM agents to autonomously hack websites. However, these studies are limited to simple vulnerabilities. In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.

Motivation & Objective

Demonstrate whether LLM agents can autonomously exploit real-world one-day vulnerabilities.
Create a benchmark of open-source, real-world CVEs suitable for reproducible testing.
Evaluate GPT-4 against other LLMs and vulnerability scanners on this benchmark.
Analyze the impact of providing versus withholding CVE descriptions on exploit success.
Estimate the cost and feasibility of LLM-based automated exploitation.

Proposed method

Construct a benchmark of 15 real-world one-day vulnerabilities from CVEs and academic sources.
Build a minimal ReAct-based agent (91 lines of code) with access to tools (browser, terminal, web results, file editor, code interpreter).
Use GPT-4 as the base model and compare with GPT-3.5 and 8 open-source models across the same setup.
Evaluate against open-source scanners (ZAP, Metasploit) which do not autonomously exploit vulnerabilities.
Run end-to-end experiments measuring pass at 5, overall success rate, and token-based cost.

Experimental results

Research questions

RQ1Can an LLM agent autonomously exploit real-world one-day vulnerabilities given CVE descriptions?
RQ2How does GPT-4 performance compare to other LLMs and to vulnerability scanners on this benchmark?
RQ3What is the impact of removing the CVE description on exploit success and vulnerability identification?
RQ4What are the practical costs of running such exploitation with GPT-4?
RQ5Do planning enhancements or subagents improve performance for exploitation tasks?

Key findings

GPT-4 achieves an 87% overall success rate (pass @ 5) on the 15-vulnerability benchmark when given CVE descriptions.
GPT-4 outperforms GPT-3.5, 8 open-source models, and open-source scanners which all achieve 0% exploitation success on the benchmark.
Without CVE descriptions, GPT-4’s success drops to 7% and vulnerability identification becomes significantly harder.
Open-source scanners (ZAP, Metasploit) cannot autonomously exploit these vulnerabilities, underscoring the gap between scanning and exploitation capabilities.
The agent’s performance suggests an emergent capability in GPT-4, with potential improvements from planning, subagents, and larger tool-usage capacity.
Average number of actions per vulnerability when CVE descriptions are used ranges across vulnerabilities, indicating task complexity and navigation requirements.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.