QUICK REVIEW

[Paper Review] Can Large Language Models Find And Fix Vulnerable Software?

David Noever|arXiv (Cornell University)|Aug 20, 2023

Software Reliability and Analysis Research12 citations

TL;DR

GPT-4 identifies far more vulnerabilities than traditional analyzers and can propose fixes, achieving a ~4x increase in findings and up to a 90% reduction in vulnerabilities after corrections across multiple languages.

ABSTRACT

In this study, we evaluated the capability of Large Language Models (LLMs), particularly OpenAI's GPT-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like Snyk and Fortify. Our analysis covered numerous repositories, including those from NASA and the Department of Defense. GPT-4 identified approximately four times the vulnerabilities than its counterparts. Furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. Our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in PHP and JavaScript. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. A critical insight was LLMs' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. Future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on LLMs' potential.

Motivation & Objective

Assess GPT-4's ability to detect software vulnerabilities compared with traditional static analyzers (e.g., Snyk and Fortify).
Evaluate GPT-4's capability to propose and apply fixes to identified vulnerabilities.
Quantify vulnerability detection/repair across diverse repositories and programming languages.
Examine whether LLMs can self-audit by proposing fixes for their own identified issues.

Proposed method

Automated querying of GPT-4 via API with system context set to act as a static code analyzer for multiple languages.
Evaluation across eight languages: C, Ruby, PHP, Java, JavaScript, C#, Go, Python.
Comparison against Snyk on six public GitHub repositories and a larger 129-file, 2372-LOC vulnerability dataset.
Final evaluation where GPT-4 is prompted to rewrite vulnerable code with fixes, then re-scanned by Snyk to measure improvement.

Experimental results

Research questions

RQ1How does GPT-4's vulnerability detection performance compare to Snyk and Fortify on diverse codebases?
RQ2Can GPT-4 generate actionable, correct fixes for identified vulnerabilities across multiple languages?
RQ3Does prompting GPT-4 to fix code reduce overall vulnerabilities when re-scanned with a static analyzer?
RQ4Are there language- or vulnerability-class patterns where GPT-4 performs especially well or poorly?

Key findings

Codebase & Reference	SLOC	High	Med	Low
Original GitHub Repo [2]	2372	66	20	12
GPT-4 Corrected GitHub Repo [38]	2636	4	5	1
Difference	+264	-62	-15	-11

GPT-4 identified about four times as many vulnerabilities as traditional analyzers (e.g., Snyk).
GPT-4 provided viable fixes for each identified vulnerability, yielding a low false-positive rate.
Corrected code via GPT-4 reduced vulnerabilities by about 90% with an 11% increase in added code lines.
Across 129 files, PHP and JavaScript contained nearly half of all findings.
Post-correction Snyk reported a reduction of high-severity vulnerabilities by 94%, medium by 75%, and low by 92% (from 98 to 10).
GPT-4 corrections produced 398 code fixes across the vulnerability set.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.