QUICK REVIEW

[论文解读] Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Avishree Khare, Saikat Dutta|arXiv (Cornell University)|Nov 16, 2023

Software Engineering Research被引用 12

一句话总结

本文在 Java 与 C/C++ 的五个漏洞数据集上对预训练的 LLMs（GPT-4、GPT-3.5、CodeLlama）进行基准测试，开发用于检测漏洞并提供解释的 prompting 策略，并与静态分析和深度学习工具进行比较，分析对抗性代码的鲁棒性以及微调对性能的影响。

ABSTRACT

While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

研究动机与目标

评估预训练的 LLMs 能否在多数据集和多语言环境中检测代码安全漏洞。
设计 prompting 策略，使 LLMs 在漏洞检测中能够产生推理和解释。
将 LLM 性能与静态分析（CodeQL）和深度学习（LineVul）方法进行比较。
评估对数据泄漏和对抗性代码攻击的鲁棒性，并探究微调对性能的影响。
为在漏洞检测中利用 LLMs 提供见解和建议。

提出的方法

在五个数据集（OWASP、Juliet Java、Juliet C/C++、CVEFixes Java、CVEFixes C/C++）上评估 GPT-4、GPT-3.5、CodeLlama-7B 与 CodeLlama-13B。
开发四种 prompting 策略：Basic、CWE-specific、基于数据流分析、以及带自我反思的数据流分析。
将 LLM 结果与 CodeQL（静态分析）和 LineVul（深度学习）进行比较。
通过语义保持的对抗性攻击评估数据泄漏并衡量性能影响。
考察对较小模型的微调及其对合成数据与真实世界数据集性能的影响。

实验结果

研究问题

RQ1预训练的 LLMs 是否能够在多样化的数据集和语言环境中检测代码安全漏洞？
RQ2 prompting 策略如何影响 LLM 漏洞检测的性能与可解释性？
RQ3在标准漏洞基准上，LLMs 与 CodeQL 与 LineVul 的比较如何？
RQ4LLMs 对对抗性代码修改和数据泄漏是否具有鲁棒性？
RQ5对较小的 LLMs 进行微调是否能够提升性能，是否能在不同数据集之间保持泛化？

主要发现

带有数据流分析基础提示 + 自我反思的 GPT-4 在合成数据集上的 F1 分数分别为 0.79（OWASP）、0.86（Juliet Java）和 0.89（Juliet C/C++）。
在同一提示下，GPT-4 在真实世界数据集上达到最高 0.48（CVEFixes Java）和 0.62（CVEFixes C/C++）。
CodeLlama 模型在合成数据集上的 F1 分数较低（例如 0.69–0.77）；在 CVEFixes C/C++ 上最高可达 0.65。
自我反思在真实世界数据集上倾向于推动 GPT-4 预测“非漏洞”，从而降低性能；CodeLlama 在这些情况下显示出更好的鲁棒性。
LineVul 在使用大数据训练时在 Juliet C/C++ 上实现了 1.0 的 F1，但缺乏可解释的 CWE 级别解释；GPT-4 提供了更易解释的推理。
对抗性代码攻击导致轻微退化（对某些攻击平均下降至 12.67% 以上）；微调较小模型可提升合成数据结果，但对真实世界数据的增益有限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。