[논문 리뷰] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering
논문은 세 가지 LLM 기반 에이전트 프레임워크(Aider, OpenHands, SWE-agent)를 비교하여 SAST 도구의 거짓 양성(FP)을 걸러내고, 백본(backbone) 의존 및 CWE 의존 성능으로 FP를 크게 감소시키는 것을 보인다.
Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.
연구 동기 및 목표
- Motivate the FP filtering problem in SAST and quantify the burden of false positives.
- Evaluate three LLM-based agent frameworks (Aider, OpenHands, SWE-agent) for FP triage
- Assess performance across benchmark (OWASP Benchmark) and real-world Vul4J Java vulnerabilities
- Analyze how backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5) and CWE categories affect results
- Provide practical guidance on deploying LLM-based agents for FP filtering
제안 방법
- Standardize prompt design and constrain external tooling for fair comparison across agent frameworks.
- Use aggregated FP alerts from four SAST tools (CodeQL, Semgrep, SonarQube, Joern) as the candidate pool.
- Evaluate each agent with three backbone models (Claude Sonnet 4, DeepSeek Chat, GPT-5).
- Compare agent performance against vanilla zero-shot prompting acting as a baseline.
- Measure performance with metrics including false positive rate (FPR), precision, recall, and computational cost (rounds, tokens).
- Analyze trajectories to identify success/failure patterns and drive practical guidelines.
실험 결과
연구 질문
- RQ1RQ1: How effective are different LLM-based agent frameworks in filtering FPs generated by SAST tools?
- RQ2RQ2: How effective are the LLM-based agent frameworks in identifying FPs in real-world scenarios?
- RQ3RQ3: What are the key success drivers and recurring failure modes of LLM-based agents in FP identification?
주요 결과
- LLM-based agents can substantially reduce SAST noise, with the best configuration achieving an FP remaining rate as low as 6.3% on OWASP Benchmark (from an initial >92% FP rate).
- On Vul4J real-world CodeQL findings, agents reach up to 93.3% FP identification rate in some configurations.
- Benefits of agentic filtering are highly backbone- and CWE-dependent; stronger models see more gains, while weaker backbones show limited or inconsistent improvements.
- Aggressive FP reduction can suppress true vulnerabilities, revealing important trade-offs between FP removal and vulnerability preservation.
- There is large variability in computational cost across agent frameworks, defining a practical cost-accuracy frontier for deployment.
- Across backbones, Claude and GPT-based agents can match or surpass vanilla prompting, while for DeepSeek zero-shot prompting can perform best.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.