Skip to main content
QUICK REVIEW

[论文解读] MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution

Zihan Wu, Jie Xu|arXiv (Cornell University)|Jan 26, 2026
Adversarial Robustness in Machine Learning被引用 0
一句话总结

MulVul 使用两阶段的 Router-Detector 多代理系统,结合检索- grounded 推理和跨模型提示演化,检测多种代码漏洞类型,在 PrimeVul 上达到 state-of-the-art Macro-F1。

ABSTRACT

Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose extbf{MulVul}, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emph{Router} agent first predicts the top-$k$ coarse categories and then forwards the input to specialized \emph{Detector} agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emph{Cross-Model Prompt Evolution}, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79\% Macro-F1, outperforming the best baseline by 41.5\%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6\% over manual prompts by effectively handling diverse vulnerability patterns.

研究动机与目标

  • 解决在数百种 CWE 类型中的自动漏洞检测的异质性与可扩展性挑战。
  • 提出自 coarse-to-fine 的 Router-Detector 架构,将输入路由到专门的检测器。
  • 通过跨模型演化实现提示自动优化,避免自我纠错偏差并提升鲁棒性。
  • 以 SCALE 为基础的检索-grounded 推理来减轻幻觉问题。
  • 在 PrimeVul 的 130 种 CWE 类型上展示 state-of-the-art 性能,包括少样本情形。

提出的方法

  • 采用 coarse-to-fine 的 Router-Detector 框架,Router 预测前 k 个粗类别并选择对应 Detectors 进行细粒度漏洞类型检测。
  • 使用 SCALE 基于的结构化表示来 grounding 代码语义并指导检索。
  • 离线准备阶段,构建 SCALE 基础的知识库 K,并使用 Cross-Model Prompt Evolution 对 Router 和 Detectors 的提示进行优化。
  • Cross-Model Prompt Evolution 使生成端 ( Claude ) 与执行端 ( GPT-4o ) 解耦,在独立的 LLM 评估下迭代演化提示。
  • Detector 在同类、干净、以及跨类的困难负样本上进行对比检索以提升精度。
  • 在线检测阶段,Router 检索跨类别证据;Detector 使用特定类别的检索证据来识别具体的漏洞类型;结果进行聚合。
Figure 1: Comparison between MulVul and existing LLM-based vulnerability detection methods. (a) Existing methods rely on fixed prompts and lack external grounding. (b) MulVul adopts a coarse-to-fine, retrieval-augmented multi-agent framework for multi-type vulnerability detection.
Figure 1: Comparison between MulVul and existing LLM-based vulnerability detection methods. (a) Existing methods rely on fixed prompts and lack external grounding. (b) MulVul adopts a coarse-to-fine, retrieval-augmented multi-agent framework for multi-type vulnerability detection.

实验结果

研究问题

  • RQ1MulVul 在粗类别和细粒度类型层面对比现有基于 LLM 的漏洞检测方法表现如何?
  • RQ2Router 的前 k 参数对精准率-召回率权衡及总体 Macro-F1 的影响如何?
  • RQ3检索 grounding、多代理架构与提示演化对性能各自贡献有多大?
  • RQ4在少样本 CWE 场景及整体数据规模下,MulVul 的表现如何?

主要发现

MethodMacro-PrecisionMacro-RecallMacro-F1
GPT-4o3.86
LLM × CPG27.4462.8138.20
LLMVulExp41.50
VISION26.80
MulVul (Ours)50.4158.45
  • MulVul 在类别层面达到 50.41% Macro-F1,领先最佳基线 41.5%。
  • MulVul 在类型层面达到 34.79% Macro-F1,超越最佳基线 10.21 个点。
  • Macro-Recall 随着 k 增大而提升,而 Macro-Precision 降低,Macro-F1 在 k=3 时达到峰值。
  • 消融实验显示检索 grounding 至关重要,移除检索后 Macro-F1 从 34.56% 降至 21.80%。
  • 跨模型提示演化带来显著提升,与手动提示相比,演化提示的 F1 提升显著(手动提示导致 F1 下降 11.76%)。
  • MulVul 在少样本情形下表现出色,对 CWE 少于 100 个样本时的 F1 约为 48%,在约 300 个样本时约为 63% 的 F1。
Figure 2: Overview of MulVul for vulnerability detection. The router agent first selects top- $k$ candidate vulnerability categories, and category-specific detector agents then perform fine-grained identification with retrieved CWE-specific evidence.
Figure 2: Overview of MulVul for vulnerability detection. The router agent first selects top- $k$ candidate vulnerability categories, and category-specific detector agents then perform fine-grained identification with retrieved CWE-specific evidence.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。