Skip to main content
QUICK REVIEW

[Paper Review] IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Ziyang Li, Saikat Dutta|arXiv (Cornell University)|May 27, 2024
Software Reliability and Analysis Research14 citations
TL;DR

IRIS combines LLMs with static taint analysis to perform whole-repository vulnerability detection in Java, by inferring CWE-specific taint specifications with LLMs and augmenting CodeQL. GPT-4 achieves the best results, detecting 69 vulnerabilities (vs CodeQL’s 27) and reducing false positives by up to ~80%.

ABSTRACT

Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL's average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools. IRIS is available publicly at https://github.com/iris-sast/iris.

Motivation & Objective

  • Motivate the need for scalable, whole-repository vulnerability detection beyond method-level analysis.
  • Propose a neuro-symbolic pipeline that fuses LLM-driven taint specification inference with static taint analysis (CodeQL).
  • Curate CWE-Bench-Java, a real-world Java vulnerability dataset, to evaluate whole-project reasoning capabilities.
  • Demonstrate that on CWE-Bench-Java, IRIS improves vulnerability detection over CodeQL and reduces false positives via contextual LLM-based filtering.

Proposed method

  • Build a Java project data-flow graph and extract candidate APIs using static analysis (CodeQL).
  • Infer CWE-specific taint sources and sinks for external/internal APIs by prompting LLMs and returning JSON-formatted specs.
  • Translate LLM-inferred specs into CodeQL taint-analysis queries to detect unsanitized data flows.
  • Run CodeQL with CWE-specific queries to obtain candidate vulnerable paths, then use LLM-based contextual analysis to filter false positives.
  • Evaluate across multiple LLMs (GPT-4, GPT-3.5, Llama variants, DeepSeekCoder, Mistral, Gemma) on CWE-Bench-Java.
  • Present results and analyze precision of inferred specs and the effectiveness of contextual filtering.

Experimental results

Research questions

  • RQ1How many known vulnerabilities can IRIS detect in CWE-Bench-Java compared to CodeQL?
  • RQ2How effective is the contextual analysis in reducing false positives without sacrificing true positives?
  • RQ3How accurately can LLMs infer source/sink taint specifications for external/internal APIs for each CWE?

Key findings

  • IRIS detects 69 vulnerabilities on CWE-Bench-Java using GPT-4, which is 42 more than CodeQL (27).
  • GPT-4 generally yields the best performance among tested LLMs, with smaller specialized models (e.g., DeepSeekCoder 8B) also performing strongly (e.g., 67 detections).
  • Contextual analysis reduces the number of reported paths dramatically (up to 81% fewer paths with GPT-4) while preserving true positives.
  • On average, inferred source/sink specifications from GPT-4 and DeepSeekCoder are around 4% of candidates, with GPT-4 achieving higher precision (over 70%) in manual checks.
  • OS Command Injection (CWE-78) remains particularly challenging for many LLMs due to complex gadget-chain patterns, highlighting static-analysis limitations.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.