QUICK REVIEW

[논문 리뷰] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Yunpeng Xiong, Ting Zhang|arXiv (Cornell University)|2026. 01. 30.

Web Application Security Vulnerabilities인용 수 0

한 줄 요약

논문은 세 가지 LLM 기반 에이전트 프레임워크(Aider, OpenHands, SWE-agent)를 비교하여 SAST 도구의 거짓 양성(FP)을 걸러내고, 백본(backbone) 의존 및 CWE 의존 성능으로 FP를 크게 감소시키는 것을 보인다.

ABSTRACT

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

연구 동기 및 목표

Motivate the FP filtering problem in SAST and quantify the burden of false positives.
Evaluate three LLM-based agent frameworks (Aider, OpenHands, SWE-agent) for FP triage
Assess performance across benchmark (OWASP Benchmark) and real-world Vul4J Java vulnerabilities
Analyze how backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5) and CWE categories affect results
Provide practical guidance on deploying LLM-based agents for FP filtering

제안 방법

Standardize prompt design and constrain external tooling for fair comparison across agent frameworks.
Use aggregated FP alerts from four SAST tools (CodeQL, Semgrep, SonarQube, Joern) as the candidate pool.
Evaluate each agent with three backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5).
Compare agent performance against vanilla zero-shot prompting acting as a baseline.
Measure performance with metrics including false positive rate (FPR), precision, recall, and computational cost (rounds, tokens).
Analyze trajectories to identify success/failure patterns and drive practical guidelines.

실험 결과

연구 질문

RQ1RQ1: How effective are different LLM-based agent frameworks in filtering FPs generated by SAST tools?
RQ2RQ2: How effective are the LLM-based agent frameworks in identifying FPs in real-world scenarios?
RQ3RQ3: What are the key success drivers and recurring failure modes of LLM-based agents in FP identification?

주요 결과

LLM-based agents can substantially reduce SAST noise, with the best configuration achieving an FP remaining rate as low as 6.3% on OWASP Benchmark (from an initial >92% FP rate).
On Vul4J real-world CodeQL findings, agents reach up to 93.3% FP identification rate in some configurations.
Benefits of agentic filtering are highly backbone- and CWE-dependent; stronger models see more gains, while weaker backbones show limited or inconsistent improvements.
Aggressive FP reduction can suppress true vulnerabilities, revealing important trade-offs between FP removal and vulnerability preservation.
There is large variability in computational cost across agent frameworks, defining a practical cost-accuracy frontier for deployment.
Across backbones, Claude and GPT-based agents can match or surpass vanilla prompting, while for DeepSeek zero-shot prompting can perform best.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.