QUICK REVIEW

[論文レビュー] Vulnerability Detection with Code Language Models: How Far Are We?

Yangruibo Ding, Yanjun Fu|arXiv (Cornell University)|Mar 27, 2024

Network Security and Intrusion Detection被引用数 13

ひとこと要約

この論文はコード言語モデルを脆弱性検出の脆弱性を監査し、既存ベンチマークのデータ品質と評価の欠陥を明らかにし、厳密なラベリングとデデュプリケーションを備えたPrimeVulを導入し、現実的な設定下で現行モデルが低性能であることを示す。

ABSTRACT

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

研究の動機と目的

既存の脆弱性検出データセットとベンチマークの欠点を分析する。
高精度ラベリングとデデュプリケーションを備えたPrimeVulを提案し、データ漏洩を減らす。
VD-Sと対ペア機能評価を含む現実的な評価ガイドラインを導入する。
PrimeVul上でさまざまなオープンソースコードLMを経験的に評価し、現実的な性能ベースラインを確立する。

提案手法

既存のvdベンチマークにおけるデータ収集、ラベリングの精度、重複を批判的に分析する。
PrimeVulを2つのラベリング手法（PrimeVul - OneFunc と PrimeVul - NVDCheck）と徹底的なデデュプリケーションを用いて構築する。
漏洩を減らしVD-Sと対ペア機能評価を導入するために時系列のトレーニング/検証/テスト分割を適用する。
複数のオープンソースコードLMをファインチューニングしてPrimeVul上で現実的な設定の下で評価する。
精度、F1、VD-S、対ペア結果を含む指標を報告し、実用的な有用性を評価する。

Figure 1: The template for the two-shot prompt.

実験結果

リサーチクエスチョン

RQ1RQ1: オープンソースのコードLMはPrimeVulでどのように性能を示すか。
RQ2RQ2: 高度なトレーニング技術は脆弱性検出性能を改善できるか。
RQ3RQ3: より大きな言語モデル（LLMs）は脆弱性検出性能を向上させるか。

主な発見

モデル	訓練	テスト	精度	F1	VD-S	P-C	P-V	P-B	P-R
CT5	BV	BV	95.67	64.93	77.30	24.98	50.90	22.79	1.33
PV	PV	PV	97.00	5.82	95.97	0.18	3.01	96.10	0.71
PV	PV	PV	96.67	19.7	89.93	1.06	12.94	84.75	1.24
CB	BV	BV	95.57	62.88	81.77	22.60	48.34	27.83	1.23
PV	BV	BV	97.04	4.49	95.54	0.35	1.95	96.99	0.71
PV	PV	PV	96.87	20.86	88.78	1.77	11.35	86.17	0.71
UC	BV	BV	96.46	65.46	62.30	39.60	23.74	33.24	3.42
PV	BV	BV	97.27	1.94	95.11	0.35	0.35	98.76	0.53
PV	PV	PV	96.86	21.43	89.21	1.60	12.06	85.11	1.24
SC2	BV	BV	96.20	68.26	69.14	35.23	41.98	20.61	2.18
PV	BV	BV	97.09	3.09	96.83	0.89	0.89	97.70	0.53
PV	PV	PV	97.02	18.05	89.64	2.30	8.16	84.22	1.95
CG2.5	BV	BV	96.57	67.30	61.73	40.84	26.02	29.63	3.51
PV	BV	BV	97.23	1.91	95.68	1.24	0.00	98.76	0.00
PV	PV	PV	96.65	19.61	91.51	3.01	10.82	84.22	1.95

既存のベンチマークは現実的な設定でのコードLMの脆弱性検出性能を過大評価している。
PrimeVulには6,968件の脆弱と228,800件の非脆弱関数、140のCWEsを含み、手動ベンチマークと同等の高いラベリング精度を持つ。
コードLMはPrimeVulで評価すると大きな性能ギャップを示す（例：BigVulでF1が68.26%のモデルがPrimeVulでF1が3.09%に低下）。
高度なトレーニング技術はわずかな利益しか生まれず、GPT-3.5やGPT-4の結果は信頼性の高い改善には至らず、厳格な設定ではランダム推測に近づくことがある。
新しい評価指標（VD-S）と対ペア機能評価は、従来のAccuracy/F1指標では捉えられない弱点を明らかにする。
時系列データ分割は漏洩を緩和し、現実世界のモデル展開の制約をよりよく反映する。

Figure 2: The template for the chain-of-thought prompt

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。