QUICK REVIEW

[논문 리뷰] MulVul: Retrieval-augmented Multi-Agent Code Vulnerability Detection via Cross-Model Prompt Evolution

Zihan Wu, Jie Xu|arXiv (Cornell University)|2026. 01. 26.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

MulVul은 조회-기반 추론과 교차모델 프롬프트 진화를 통해 다수의 코드 취약성 유형을 탐지하는 두 단계 Router-Detector 멀티에이전트 시스템을 사용하여 PrimeVul에서 Macro-F1의 최첨단을 달성한다.

ABSTRACT

Large Language Models (LLMs) struggle to automate real-world vulnerability detection due to two key limitations: the heterogeneity of vulnerability patterns undermines the effectiveness of a single unified model, and manual prompt engineering for massive weakness categories is unscalable. To address these challenges, we propose extbf{MulVul}, a retrieval-augmented multi-agent framework designed for precise and broad-coverage vulnerability detection. MulVul adopts a coarse-to-fine strategy: a \emph{Router} agent first predicts the top-$k$ coarse categories and then forwards the input to specialized \emph{Detector} agents, which identify the exact vulnerability types. Both agents are equipped with retrieval tools to actively source evidence from vulnerability knowledge bases to mitigate hallucinations. Crucially, to automate the generation of specialized prompts, we design \emph{Cross-Model Prompt Evolution}, a prompt optimization mechanism where a generator LLM iteratively refines candidate prompts while a distinct executor LLM validates their effectiveness. This decoupling mitigates the self-correction bias inherent in single-model optimization. Evaluated on 130 CWE types, MulVul achieves 34.79\% Macro-F1, outperforming the best baseline by 41.5\%. Ablation studies validate cross-model prompt evolution, which boosts performance by 51.6\% over manual prompts by effectively handling diverse vulnerability patterns.

연구 동기 및 목표

수백 개의 CWE 유형에 걸친 자동 취약성 탐지의 이질성과 확장성 문제를 해결한다.
전처리된 coarse-to-fine Router-Detector 아키텍처를 제안하여 입력을 특화된 탐지기로 분류한다.
자기 수정 편향을 피하고 강건성을 높이기 위해 cross-model evolution을 통한 프롬프트 최적화를 자동화한다.
SCALE 기반 검색으로 추론을 근거화하여 환각(hallucination)을 완화한다.
130개의 CWE 유형에 걸쳐 PrimeVul에서 최신 성능을 입증하며 few-shot regime를 포함한다.

제안 방법

Router가 top-k의 거친 범주를 예측하고 세부 취약성 유형에 해당하는 Detector를 선택하도록 coarse-to-fine Router-Detector 프레임워크를 채택한다.
코드 의미를 확립하고 검색을 유도하기 위해 SCALE 기반의 구조화된 표현을 사용한다.
오프라인 준비에서 SCALE 기반 지식베이스 K를 구성하고 Cross-Model Prompt Evolution을 사용하여 Router와 Detectors의 프롬프트를 최적화한다.
Cross-Model Prompt Evolution은 생성기( Claude )와 실행기( GPT-4o )를 분리하여 독립된 LLM으로 평가하면서 프롬프트를 반복적으로 진화시킨다.
탐지기는 같은 카테고리 내의, 깨끗한, 그리고 카테고리 외의 어려운 음성들에 대해 대비적 검색을 수행하여 정밀도를 높인다.
온라인 탐지 중 Router가 교차 카테고리 증거를 검색하고; Detectors는 카테고리별 검색 증거를 사용하여 정확한 취약성 유형을 식별하며; 결과를 집계한다.

Figure 1: Comparison between MulVul and existing LLM-based vulnerability detection methods. (a) Existing methods rely on fixed prompts and lack external grounding. (b) MulVul adopts a coarse-to-fine, retrieval-augmented multi-agent framework for multi-type vulnerability detection.

실험 결과

연구 질문

RQ1MulVul은 거친 범주 수준과 세부 유형 수준에서 기존 LLM 기반 취약성 탐지 방법과 비교하여 어떤 성능을 보인가?
RQ2Router의 top-k 매개변수가 정밀도-재현율 트레이드오프 및 전반적인 Macro-F1에 미치는 영향은?
RQ3검색 근거화, 다중 에이전트 아키텍처, 프롬프트 진화가 성능에 기여하는 정도는?
RQ4few-shot CWE 시나리오와 전반적인 데이터 조건에서 MulVul의 성능은?

주요 결과

Method	Macro-Precision	Macro-Recall	Macro-F1
GPT-4o	3.86	—	—
LLM × CPG	27.44	62.81	38.20
LLMVulExp	41.50	—	—
VISION	26.80	—	—
MulVul (Ours)	50.41	58.45	—

MulVul은 범주 수준에서 50.41% Macro-F1을 달성하여 최고 베이스라인보다 41.5% 포인트 앞섰다.
타입 수준에서 34.79% Macro-F1을 달성하여 최고 베이스라인보다 10.21포인트 앞섰다.
Macro-Recall은 더 큰 k에서 향상되나 Macro-Precision은 감소하고 Macro-F1은 k=3에서 정점에 달한다.
세부 분석에서 검색 근거화가 결정적이며 검색 제거 시 Macro-F1이 34.56%에서 21.80%로 감소했다.
크로스-모델 프롬프트 진화가 상당한 이득을 제공하며, 수동 프롬프트는 진화된 프롬프트에 비해 F1이 11.76% 감소한다.
MulVul은 강한 few-shot 성능을 보이며, 샘플 수가 100개 미만인 CWE에서 약 48% F1, 약 300샘플 주변에서 약 63% F1을 달성한다.

Figure 2: Overview of MulVul for vulnerability detection. The router agent first selects top- $k$ candidate vulnerability categories, and category-specific detector agents then perform fine-grained identification with retrieved CWE-specific evidence.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.