[论文解读] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
该论文比较了三种基于大型语言模型(LLM)的代理框架(Aider、OpenHands、SWE-agent)用于从 SAST 工具中过滤误报,显示在主干模型和 CWE 依赖的性能下显著降低 FP。
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.
研究动机与目标
- Motivate the FP filtering problem in SAST and quantify the burden of false positives.
- Evaluate three LLM-based agent frameworks (Aider, OpenHands, SWE-agent) for FP triage
- Assess performance across benchmark (OWASP Benchmark) and real-world Vul4J Java vulnerabilities
- Analyze how backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5) and CWE categories affect results
- Provide practical guidance on deploying LLM-based agents for FP filtering
提出的方法
- Standardize prompt design and constrain external tooling for fair comparison across agent frameworks.
- Use aggregated FP alerts from four SAST tools (CodeQL, Semgrep, SonarQube, Joern) as the candidate pool.
- Evaluate each agent with three backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5).
- Compare agent performance against vanilla zero-shot prompting acting as a baseline.
- Measure performance with metrics including false positive rate (FPR), precision, recall, and computational cost (rounds, tokens).
- Analyze trajectories to identify success/failure patterns and drive practical guidelines.
实验结果
研究问题
- RQ1RQ1: How effective are different LLM-based agent frameworks in filtering FPs generated by SAST tools?
- RQ2RQ2: How effective are the LLM-based agent frameworks in identifying FPs in real-world scenarios?
- RQ3RQ3: What are the key success drivers and recurring failure modes of LLM-based agents in FP identification?
主要发现
| Model | Agent | FPR (compared to SAST) | Notes |
|---|---|---|---|
| Claude Sonnet 4 | Aider | 14.3% | (↓84.1%) |
| Claude Sonnet 4 | OpenHands | 14.9% | (↓83.5%) |
| Claude Sonnet 4 | SWE-agent | 6.3% | (↓92.1%) |
| Claude Sonnet 4 | Vanilla LLM | 23.0% | Baseline |
| DeepSeek Chat | Aider | 13.2% | (↓85.1%) |
| DeepSeek Chat | OpenHands | 15.8% | (↓82.6%) |
| DeepSeek Chat | SWE-agent | 13.1% | (↓85.2%) |
| DeepSeek Chat | Vanilla LLM | 11.2% | Baseline |
| GPT-5 | Aider | 20.3% | (↓78.0%) |
| GPT-5 | OpenHands | 16.3% | (↓82.0%) |
| GPT-5 | SWE-agent | 14.1% | (↓84.2%) |
| GPT-5 | Vanilla LLM | 20.4% | Baseline |
- LLM-based agents can substantially reduce SAST noise, with the best configuration achieving an FP remaining rate as low as 6.3% on OWASP Benchmark (from an initial >92% FP rate).
- On Vul4J real-world CodeQL findings, agents reach up to 93.3% FP identification rate in some configurations.
- Benefits of agentic filtering are highly backbone- and CWE-dependent; stronger models see more gains, while weaker backbones show limited or inconsistent improvements.
- Aggressive FP reduction can suppress true vulnerabilities, revealing important trade-offs between FP removal and vulnerability preservation.
- There is large variability in computational cost across agent frameworks, defining a practical cost-accuracy frontier for deployment.
- Across backbones, Claude and GPT-based agents can match or surpass vanilla prompting, while for DeepSeek zero-shot prompting can perform best.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。