QUICK REVIEW

[论文解读] IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

Ziyang Li, Saikat Dutta|arXiv (Cornell University)|May 27, 2024

Software Reliability and Analysis Research被引用 14

一句话总结

IRIS 将 LLM 与静态污点分析结合，在 Java 中进行整仓库漏洞检测，通过用 LLM 推断 CWE-specific 的污点规范并增强 CodeQL。GPT-4 获得最佳结果，检测到 69 个漏洞（相较 CodeQL 的 27），并将误报减少多达约 80%。

ABSTRACT

Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL's average false discovery rate by 5% points. Furthermore, IRIS identifies 4 previously unknown vulnerabilities which cannot be found by existing tools. IRIS is available publicly at https://github.com/iris-sast/iris.

研究动机与目标

动机：需要可扩展的、覆盖整个代码库的漏洞检测，超越方法级分析。
提出一个神经符号混合管道，将由 LLM 驱动的污点规范推断与静态污点分析（CodeQL）融合。
整理并提供 CWE-Bench-Java，这是一个真实世界的 Java 漏洞数据集，用以评估整体项目推理能力。
展示在 CWE-Bench-Java 上，IRIS 在漏洞检测方面优于 CodeQL，并通过基于上下文的 LLM 过滤降低误报。

提出的方法

使用静态分析（CodeQL）构建 Java 项目数据流图并提取候选 API。
通过提示 LLM 并返回 JSON 格式的规范，推断外部/内部 API 的 CWE 特定污点源和汇。
将 LLM 推断的规范转换为 CodeQL 污点分析查询，以检测未经净化的数据流。
运行带有 CWE 特定查询的 CodeQL，以获得候选的易受攻击路径，然后使用基于上下文的 LLM 分析来过滤误报。
评估在多种 LLM（GPT-4、GPT-3.5、Llama 变体、DeepSeekCoder、Mistral、Gemma）上的 CWE-Bench-Java。
呈现结果并分析推断规范的精确度及上下文过滤的有效性。

实验结果

研究问题

RQ1相比 CodeQL，IRIS 在 CWE-Bench-Java 上能检测出多少已知漏洞？
RQ2上下文分析在不牺牲真实正例的情况下减少误报的效果如何？
RQ3对于每个 CWE，LLMs 在推断外部/内部 API 的污点源/汇规范方面能达到多高的准确性？

主要发现

IRIS 使用 GPT-4 在 CWE-Bench-Java 上检测到 69 个漏洞，比 CodeQL 的 27 多出 42 个。
GPT-4 在 tested LLMs 中通常表现最好，较小的专用模型（如 DeepSeekCoder 8B）也表现强劲（例如 67 次检测）。
上下文分析显著减少报告的路径数量（GPT-4 下最多减少 81% 的路径），同时保留真实正例。
平均而言，GPT-4 和 DeepSeekCoder 的推断源/汇规范约占候选项的 4%，其中 GPT-4 在人工检查中达到更高的精确度（超过 70%）。
OS 命令注入（CWE-78）对许多 LLM 仍然具有挑战性，因为其复杂的 gadget-chain 模式，凸显静态分析的局限性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。