QUICK REVIEW

[论文解读] To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

Benjamin Steenhoek, Md. Mahbubur Rahman|arXiv (Cornell University)|Mar 25, 2024

Network Security and Intrusion Detection被引用 17

一句话总结

本论文评估了11种最先进的语言模型在漏洞检测中的表现，发现推理能力较差（0.5–0.63 Balanced Accuracy），在定位和解释漏洞方面频繁出错，与人类专家的对齐度有限。

ABSTRACT

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at https://doi.org/10.6084/m9.figshare.27368025.

研究动机与目标

评估最先进的 LLMs 从代码中检测软件漏洞的能力。
研究提示技术（基础提示、上下文提示、思维链/链式推理）如何影响漏洞检测。
描述 LLM 在漏洞解释中的错误类型。
将 LLM 在漏洞定位上的表现与标准基准上的人类表现进行比较。

提出的方法

对 11 个面向代码生成的 LLM 进行调查，并用五种提示模板对其进行评估，其中包含三种新颖技术。
使用 SVEN 数据集（100 对有漏洞和修复代码的函数）来创建一个用于漏洞检测的二分类任务。
系统性搜索高性能的提示和配置，包括 Basic、IC-Random、IC-Embedding、CoT-CVE 和 CoT-SA 提示。
分析 287 个 LLM 响应，将错误分入 Code Understanding、Hallucination/Memorization/Repetition、Logic 和 Common Knowledge。
在 DbgBench 基准上评估故障定位并与人类表现进行比较。

实验结果

研究问题

RQ1RQ1：在基于 LLM 的漏洞检测中，最成功和最不成功的提示设计是什么？
RQ2RQ2：最先进的 LLMs 在漏洞检测中的表现如何？
RQ3RQ3：LLMs 在解释漏洞时会犯哪些类型的错误？
RQ4RQ4：在漏洞定位方面，LLMs 与人类开发人员相比如何？

主要发现

Balanced Accuracy 为 0.5–0.63，跨模型，接近随机猜测。
76% 的 buggy-vs-fixed 对无法被模型区分。
LLMs 正确定位只有 27 个 DbgBench 漏洞中的 6 个；GPT-3 表现最好，4/27。
57% 的回应在代码理解、推理/逻辑或常识方面存在错误，边界/空值检查经常被错误识别。
在报告漏洞位置、类型和根本原因时，解释的准确性显著下降（18–100% 的下降）。
在 DbgBench 的定位任务中，人类表现显示出比所评估的 LLMs 更高的故障定位可靠性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。