QUICK REVIEW

[論文レビュー] To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

Benjamin Steenhoek, Md. Mahbubur Rahman|arXiv (Cornell University)|Mar 25, 2024

Network Security and Intrusion Detection被引用数 17

ひとこと要約

この論文は脆弱性検出において11の最先端LLMを評価し、推論性能が貧弱（0.5–0.63 Balanced Accuracy）で、脆弱性の特定と説明における誤りが頻繁、人間の専門家との整合性は限定的であることを示した。

ABSTRACT

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at https://doi.org/10.6084/m9.figshare.27368025.

研究の動機と目的

コードからソフトウェアの脆弱性を検出する最先端LLMの能力を評価する。
prompting技術（基本、インコンテキスト、思考過程の連鎖）が脆弱性検出に与える影響を調査する。
LLMによる脆弱性の説明における誤りの種類を特徴づける。
標準的なベンチマークでの人間の性能とLLMによる脆弱性の局所化を比較する。

提案手法

コード生成志向の11のLLMを調査し、3つの新規技術を含む5つのプロンプトテンプレートで評価する。
SVENデータセット（バグあり/修正コードの100関数ペア）を用いて脆弱性検出の2値分類タスクを作成する。
Basic、IC-Random、IC-Embedding、CoT-CVE、CoT-SAプロンプトを含む高性能プロンプトと設定を体系的に探索する。
287件のLLM応答を分析し、エラーを Code Understanding、Hallucination/Memorization/Repetition、Logic、Common Knowledge に分類する。
DbgBenchベンチマークで故障局在を評価し、人間の性能と比較する。

実験結果

リサーチクエスチョン

RQ1RQ1: LLMベースの脆弱性検出における最も成功したプロンプト設計と最も不成功だった設計は何か。
RQ2RQ2: 最先端のLLMは脆弱性検出でどの程度の性能を示すか。
RQ3RQ3: 脆弱性を説明する際にLLMsはどのような誤りを犯すか。
RQ4RQ4: ローカライズ（局所化）においてLLMsは人間の開発者とどのように比較されるか。

主な発見

Balanced Accuracy of 0.5–0.63 across models, close to random guessing.
76% of buggy-vs-fixed pairs could not be distinguished by the models.
LLMs correctly located only 6 of 27 DbgBench bugs; GPT-3 performed best with 4/27.
57% of responses contained errors in code understanding, logic, or common knowledge, with bounds/null checks frequently misidentified.
Explanations showed substantial drops in accuracy when reporting bug location, type, and root cause (18–100% drop).
Human performance on localization in DbgBench indicates higher reliability in fault localization compared to the evaluated LLMs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。