QUICK REVIEW

[論文レビュー] MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution

Zihan Wu, Jie Xu|arXiv (Cornell University)|Jan 26, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

MulVul は、取得-grounded reasoning と跨モデルプロンプト進化を用いた二段階 Router-Detector マルチエージェントシステムで、複数のコード脆弱性タイプを検出し、PrimeVul で Macro-F1 の最先端を達成します。

ABSTRACT

Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose \textbf{MulVul}, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emph{Router} agent first predicts the top-$k$ coarse categories and then forwards the input to specialized \emph{Detector} agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emph{Cross-Model Prompt Evolution}, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79\% Macro-F1, outperforming the best baseline by 41.5\%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6\% over manual prompts by effectively handling diverse vulnerability patterns.

研究の動機と目的

百種類以上の CWE タイプにわたる自動的な脆弱性検出の異種性とスケーラビリティ課題に対処する。
入力を専門 Detectors にルートする粗・細の Router-Detector アーキテクチャを提案する。
自己訂正バイアスを避け、頑健性を向上させるためにクロスモデル進化によるプロンプト最適化を自動化する。
SCALE ベースの知識ベースから推論を地固めし、幻覚を緩和する。
130 CWE タイプ、 few-shot レジームを含む PrimeVul で最先端性能を実証する。

提案手法

Router が上位の粗カテゴリを予測し、対応する Detectors を選択して細粒度の脆弱性タイプを扱う粗・細の Router-Detector フレームワークを採用する。
SCALE ベースの構造化表現を用いてコードの意味を地固めし、検索を誘導する。
オフライン準備では SCALE ベースの知識ベース K を構築し、Router および Detectors のプロンプトを Cross-Model Prompt Evolution で最適化する。
Cross-Model Prompt Evolution は生成器（Claude）を実行器（GPT-4o）から切り離して、独立した LLM で評価しつつプロンプトを反復的に進化させる。
Detectors は同カテゴリ内、クリーン、異カテゴリ外の難例 negatives を用いた対照的検索を行い、精度を高める。
オンライン検出時には Router がカテゴリをまたぐ証拠を取得し、Detectors はカテゴリ特異的 retrieved 証拠を用いて正確な脆弱性タイプを同定する；結果を集計する。

Figure 1: Comparison between MulVul and existing LLM-based vulnerability detection methods. (a) Existing methods rely on fixed prompts and lack external grounding. (b) MulVul adopts a coarse-to-fine, retrieval-augmented multi-agent framework for multi-type vulnerability detection.

実験結果

リサーチクエスチョン

RQ1MulVul は粗いカテゴリレベルと細粒度タイプレベルで、既存の LLM ベース脆弱性検出法と比較してどの程度の性能を示すか。
RQ2Router の top-k パラメータが精度-再現率のトレードオフと全体的な Macro-F1 に与える影響はどの程度か。
RQ3検索地固め、マルチエージェントアーキテクチャ、プロンプト進化が性能にどの程度寄与するか。
RQ4Few-shot CWE シナリオおよび全体データレジームにおける MulVul の性能はどうなるか。

主な発見

Method	Macro-Precision	Macro-Recall	Macro-F1
GPT-4o	3.86	—	—
LLM × CPG	27.44	62.81	38.20
LLMVulExp	41.50	—	—
VISION	26.80	—	—
MulVul (Ours)	50.41	58.45	—

MulVul はカテゴリレベルで Macro-F1 50.41% を達成し、最良のベースラインを 41.5% ポイント上回る。
タイプレベルで Macro-F1 34.79% を達成し、最良のベースラインを 10.21 ポイント上回る。
Macro-Recall は k が大きいほど改善する一方、Macro-Precision は低下し、Macro-F1 は k=3 でピークを迎える。
アブレーションにより、検索地固めが重要であり、検索を除去すると Macro-F1 が 34.56% から 21.80% に低下する。
クロスモデルプロンプト進化は有意な利益をもたらし、手動プロンプトは進化したプロンプトと比較して F1 が 11.76% 減少する。
MulVul は特に 100 サンプル未満の CWEs で約 48% の F1、約 300 サンプル周辺で約 63% の F1 など、強い few-shot 性能を示す。

Figure 2: Overview of MulVul for vulnerability detection. The router agent first selects top- $k$ candidate vulnerability categories, and category-specific detector agents then perform fine-grained identification with retrieved CWE-specific evidence.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。