[Paper Review] Can Large Language Models Find And Fix Vulnerable Software?
GPT-4 identifies far more vulnerabilities than traditional analyzers and can propose fixes, achieving a ~4x increase in findings and up to a 90% reduction in vulnerabilities after corrections across multiple languages.
In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.
Motivation & Objective
- Assess GPT-4's ability to detect software vulnerabilities compared with traditional static analyzers (e.g., Snyk and Fortify).
- Evaluate GPT-4's capability to propose and apply fixes to identified vulnerabilities.
- Quantify vulnerability detection/repair across diverse repositories and programming languages.
- Examine whether LLMs can self-audit by proposing fixes for their own identified issues.
Proposed method
- Automated querying of GPT-4 via API with system context set to act as a static code analyzer for multiple languages.
- Evaluation across eight languages: C, Ruby, PHP, Java, JavaScript, C#, Go, Python.
- Comparison against Snyk on six public GitHub repositories and a larger 129-file, 2372-LOC vulnerability dataset.
- Final evaluation where GPT-4 is prompted to rewrite vulnerable code with fixes, then re-scanned by Snyk to measure improvement.
Experimental results
Research questions
- RQ1How does GPT-4's vulnerability detection performance compare to Snyk and Fortify on diverse codebases?
- RQ2Can GPT-4 generate actionable, correct fixes for identified vulnerabilities across multiple languages?
- RQ3Does prompting GPT-4 to fix code reduce overall vulnerabilities when re-scanned with a static analyzer?
- RQ4Are there language- or vulnerability-class patterns where GPT-4 performs especially well or poorly?
Key findings
| Codebase & Reference | SLOC | Critical | High | Med | Low |
|---|---|---|---|---|---|
| Original GitHub Repo [2] | 2372 | 0 | 66 | 20 | 12 |
| GPT-4 Corrected GitHub Repo [38] | 2636 | 0 | 4 | 5 | 1 |
| Difference | +264 | 0 | -62 | -15 | -11 |
- GPT-4 identified about four times as many vulnerabilities as traditional analyzers (e.g., Snyk).
- GPT-4 provided viable fixes for each identified vulnerability, yielding a low false-positive rate.
- Corrected code via GPT-4 reduced vulnerabilities by about 90% with an 11% increase in added code lines.
- Across 129 files, PHP and JavaScript contained nearly half of all findings.
- Post-correction Snyk reported a reduction of high-severity vulnerabilities by 94%, medium by 75%, and low by 92% (from 98 to 10).
- GPT-4 corrections produced 398 code fixes across the vulnerability set.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.