Skip to main content
QUICK REVIEW

[论文解读] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Yunpeng Xiong, Ting Zhang|arXiv (Cornell University)|Jan 30, 2026
Web Application Security Vulnerabilities被引用 0
一句话总结

该论文比较了三种基于大型语言模型(LLM)的代理框架(Aider、OpenHands、SWE-agent)用于从 SAST 工具中过滤误报,显示在主干模型和 CWE 依赖的性能下显著降低 FP。

ABSTRACT

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

研究动机与目标

  • Motivate the FP filtering problem in SAST and quantify the burden of false positives.
  • Evaluate three LLM-based agent frameworks (Aider, OpenHands, SWE-agent) for FP triage
  • Assess performance across benchmark (OWASP Benchmark) and real-world Vul4J Java vulnerabilities
  • Analyze how backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5) and CWE categories affect results
  • Provide practical guidance on deploying LLM-based agents for FP filtering

提出的方法

  • Standardize prompt design and constrain external tooling for fair comparison across agent frameworks.
  • Use aggregated FP alerts from four SAST tools (CodeQL, Semgrep, SonarQube, Joern) as the candidate pool.
  • Evaluate each agent with three backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5).
  • Compare agent performance against vanilla zero-shot prompting acting as a baseline.
  • Measure performance with metrics including false positive rate (FPR), precision, recall, and computational cost (rounds, tokens).
  • Analyze trajectories to identify success/failure patterns and drive practical guidelines.

实验结果

研究问题

  • RQ1RQ1: How effective are different LLM-based agent frameworks in filtering FPs generated by SAST tools?
  • RQ2RQ2: How effective are the LLM-based agent frameworks in identifying FPs in real-world scenarios?
  • RQ3RQ3: What are the key success drivers and recurring failure modes of LLM-based agents in FP identification?

主要发现

ModelAgentFPR (compared to SAST)Notes
Claude Sonnet 4Aider14.3%(↓84.1%)
Claude Sonnet 4OpenHands14.9%(↓83.5%)
Claude Sonnet 4SWE-agent6.3%(↓92.1%)
Claude Sonnet 4Vanilla LLM23.0%Baseline
DeepSeek ChatAider13.2%(↓85.1%)
DeepSeek ChatOpenHands15.8%(↓82.6%)
DeepSeek ChatSWE-agent13.1%(↓85.2%)
DeepSeek ChatVanilla LLM11.2%Baseline
GPT-5Aider20.3%(↓78.0%)
GPT-5OpenHands16.3%(↓82.0%)
GPT-5SWE-agent14.1%(↓84.2%)
GPT-5Vanilla LLM20.4%Baseline
  • LLM-based agents can substantially reduce SAST noise, with the best configuration achieving an FP remaining rate as low as 6.3% on OWASP Benchmark (from an initial >92% FP rate).
  • On Vul4J real-world CodeQL findings, agents reach up to 93.3% FP identification rate in some configurations.
  • Benefits of agentic filtering are highly backbone- and CWE-dependent; stronger models see more gains, while weaker backbones show limited or inconsistent improvements.
  • Aggressive FP reduction can suppress true vulnerabilities, revealing important trade-offs between FP removal and vulnerability preservation.
  • There is large variability in computational cost across agent frameworks, defining a practical cost-accuracy frontier for deployment.
  • Across backbones, Claude and GPT-based agents can match or surpass vanilla prompting, while for DeepSeek zero-shot prompting can perform best.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。