QUICK REVIEW

[論文レビュー] Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Yunpeng Xiong, Ting Zhang|arXiv (Cornell University)|Jan 30, 2026

Web Application Security Vulnerabilities被引用数 0

ひとこと要約

論文は3つのLLMベースのエージェントフレームワーク（Aider、OpenHands、SWE-agent）を比較し、SASTツールからの偽陽性をフィルタリングする際のFP削減を実証。 backbone-およびCWE依存のパフォーマンスを示す。

ABSTRACT

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

研究の動機と目的

SASTにおけるFPフィルタリング問題を動機づけ、偽陽性の負担を定量化する。
FPトライアルのために3つのLLMベースのエージェントフレームワーク（Aider、OpenHands、SWE-agent）を評価する。
ベンチマーク（OWASP Benchmark）と実世界のVul4J Java脆弱性でパフォーマンスを評価する。
バックボーンモデル（Claude Sonnet 4、DeepSeek Chat、GPT-5）とCWEカテゴリが結果に与える影響を分析する。
LLMベースのエージェントをFPフィルタリングに展開する際の実践的ガイダンスを提供する。）
method equivalent_to_english_fields: [

提案手法

プロンプト設計を標準化し、エージェントフレームワーク間の公正な比較のため外部ツールの利用を制約する。
4つのSASTツール（CodeQL、Semgrep、SonarQube、Joern）の aggregated FPアラートを候補プールとして使用する。
各エージェントを3つのバックボーンモデル（Claude Sonnet 4、DeepSeek Chat、GPT-5）で評価する。
ベースラインとして機能するバニラゼロショット prompting とエージェントのパフォーマンスを比較する。
false positive rate (FPR)、precision、recall、計算コスト（ラウンド、トークン）などの指標でパフォーマンスを測定する。
成功/失敗パターンを特定し、実務的なガイドラインを導く軌跡を分析する。

実験結果

リサーチクエスチョン

RQ1RQ1: SASTツールが生成するFPをフィルタリングする際、異なるLLMベースのエージェントフレームワークはどれほど効果的か。
RQ2RQ2: 実世界のシナリオにおいてLLMベースのエージェントフレームワークはFPを識別する上でどれほど効果的か。
RQ3RQ3: FP識別におけるLLMベースのエージェントの主な成功要因と再発する失敗モードは何か。

主な発見

Model	Agent	FPR (compared to SAST)	Notes
Claude Sonnet 4	Aider	14.3%	(↓84.1%)
Claude Sonnet 4	OpenHands	14.9%	(↓83.5%)
Claude Sonnet 4	SWE-agent	6.3%	(↓92.1%)
Claude Sonnet 4	Vanilla LLM	23.0%	Baseline
DeepSeek Chat	Aider	13.2%	(↓85.1%)
DeepSeek Chat	OpenHands	15.8%	(↓82.6%)
DeepSeek Chat	SWE-agent	13.1%	(↓85.2%)
DeepSeek Chat	Vanilla LLM	11.2%	Baseline
GPT-5	Aider	20.3%	(↓78.0%)
GPT-5	OpenHands	16.3%	(↓82.0%)
GPT-5	SWE-agent	14.1%	(↓84.2%)
GPT-5	Vanilla LLM	20.4%	Baseline

LLMベースのエージェントはSASTノイズを大幅に削減でき、最適な構成はOWASP BenchmarkでFP残存率を6.3%まで低減（初期は92%超のFP rateからの低減）を達成。
Vul4Jの実世界CodeQL所見では、エージェントは構成によって最大93.3%のFP識別率に達する。
エージェントによるフィルタリングの利点はbackboneおよびCWE依存が非常に大きく、より強力なモデルほど利益が大きく、弱いバックボーンでは改善が限定的または不安定。
積極的なFP削減は真の脆弱性を抑制する可能性があり、FPの除去と脆弱性の保持の間に重要なトレードオフが存在する。
エージェントフレームワーク間で計算コストのばらつきが大きく、展開の実用的なコスト-精度フロンティアを形成する。
バックボーンを横断して、ClaudeとGPTベースのエージェントはバニラ promptingに匹敵するか超えることができる一方、DeepSeekではゼロショット promptingが最も良い性能を示す場合がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。