QUICK REVIEW

[论文解读] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Yunpeng Xiong, Ting Zhang|arXiv (Cornell University)|Jan 30, 2026

Web Application Security Vulnerabilities被引用 0

一句话总结

该论文比较了三种基于大型语言模型（LLM）的代理框架（Aider、OpenHands、SWE-agent）用于从 SAST 工具中过滤误报，显示在主干模型和 CWE 依赖的性能下显著降低 FP。

ABSTRACT

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

研究动机与目标

Motivate the FP filtering problem in SAST and quantify the burden of false positives.
Evaluate three LLM-based agent frameworks (Aider, OpenHands, SWE-agent) for FP triage
Assess performance across benchmark (OWASP Benchmark) and real-world Vul4J Java vulnerabilities
Analyze how backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5) and CWE categories affect results
Provide practical guidance on deploying LLM-based agents for FP filtering

提出的方法

Standardize prompt design and constrain external tooling for fair comparison across agent frameworks.
Use aggregated FP alerts from four SAST tools (CodeQL, Semgrep, SonarQube, Joern) as the candidate pool.
Evaluate each agent with three backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5).
Compare agent performance against vanilla zero-shot prompting acting as a baseline.
Measure performance with metrics including false positive rate (FPR), precision, recall, and computational cost (rounds, tokens).
Analyze trajectories to identify success/failure patterns and drive practical guidelines.

实验结果

研究问题

RQ1RQ1: How effective are different LLM-based agent frameworks in filtering FPs generated by SAST tools?
RQ2RQ2: How effective are the LLM-based agent frameworks in identifying FPs in real-world scenarios?
RQ3RQ3: What are the key success drivers and recurring failure modes of LLM-based agents in FP identification?

主要发现

Model	Agent	FPR (compared to SAST)	Notes
Claude Sonnet 4	Aider	14.3%	(↓84.1%)
Claude Sonnet 4	OpenHands	14.9%	(↓83.5%)
Claude Sonnet 4	SWE-agent	6.3%	(↓92.1%)
Claude Sonnet 4	Vanilla LLM	23.0%	Baseline
DeepSeek Chat	Aider	13.2%	(↓85.1%)
DeepSeek Chat	OpenHands	15.8%	(↓82.6%)
DeepSeek Chat	SWE-agent	13.1%	(↓85.2%)
DeepSeek Chat	Vanilla LLM	11.2%	Baseline
GPT-5	Aider	20.3%	(↓78.0%)
GPT-5	OpenHands	16.3%	(↓82.0%)
GPT-5	SWE-agent	14.1%	(↓84.2%)
GPT-5	Vanilla LLM	20.4%	Baseline

LLM-based agents can substantially reduce SAST noise, with the best configuration achieving an FP remaining rate as low as 6.3% on OWASP Benchmark (from an initial >92% FP rate).
On Vul4J real-world CodeQL findings, agents reach up to 93.3% FP identification rate in some configurations.
Benefits of agentic filtering are highly backbone- and CWE-dependent; stronger models see more gains, while weaker backbones show limited or inconsistent improvements.
Aggressive FP reduction can suppress true vulnerabilities, revealing important trade-offs between FP removal and vulnerability preservation.
There is large variability in computational cost across agent frameworks, defining a practical cost-accuracy frontier for deployment.
Across backbones, Claude and GPT-based agents can match or surpass vanilla prompting, while for DeepSeek zero-shot prompting can perform best.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。