QUICK REVIEW

[论文解读] I Can't Believe It's Not a Valid Exploit

Derin Gezgin, Amartya Das|arXiv (Cornell University)|Feb 4, 2026

Web Application Security Vulnerabilities被引用 0

一句话总结

论文介绍 PoC-Gym，这是一个利用静态分析轨迹引导大语言模型生成 Java PoC 漏洞利用的框架，并揭示在事后验证前看似高的成功率在事后验证后崩溃，71.5% 的 PoC 在人工检查时被判定为无效。

ABSTRACT

Recently Large Language Models (LLMs) have been used in security vulnerability detection tasks including generating proof-of-concept (PoC) exploits. A PoC exploit is a program used to demonstrate how a vulnerability can be exploited. Several approaches suggest that supporting LLMs with additional guidance can improve PoC generation outcomes, motivating further evaluation of their effectiveness. In this work, we develop PoC-Gym, a framework for PoC generation for Java security vulnerabilities via LLMs and systematic validation of generated exploits. Using PoC-Gym, we evaluate whether the guidance from static analysis tools improves the PoC generation success rate and manually inspect the resulting PoCs. Our results from running PoC-Gym with Claude Sonnet 4, GPT-5 Medium, and gpt-oss-20b show that using static analysis for guidance and criteria lead to 21% higher success rates than the prior baseline, FaultLine. However, manual inspection of both successful and failed PoCs reveals that 71.5% of the PoCs are invalid. These results show that the reported success of LLM-based PoC generation can be significantly misleading, which is hard to detect with current validation mechanisms.

研究动机与目标

评估在使用 LLM 时，静态分析指导是否能提高 PoC 生成在 Java 漏洞中的成功率。
系统性地将生成的 PoC 与地面真相漏洞位置进行比对，以检测误报。
理解当前基于 LLM 的 PoC 生成管道的局限性以及执行级验证的有效性。
将 PoC-Gym 与以前的系统（如 FaultLine）在真实世界 CVE 场景中进行对比。
识别失败模式以及错误在基于 LLM 的 PoC 生成中的传播。

提出的方法

开发 PoC-Gym，这是一个包含提示构建、由 LLM 生成 PoC、以及基于执行的验证与反馈的管道。
在提示中融合漏洞上下文（CVE/CWE）、来自 CodeQL/IRIS 的静态数据流轨迹，以及仓库元数据，以引导 PoC 生成。
通过编译、执行，以及带有 AspectJ 插件的接点检查来验证 PoC，以确定 [VULN] 信号与接点可达性。
用真实世界的 20 个 Java CVE 场景（CWE-Bench-Java）进行评估，使用多种 LLM（Claude Sonnet 4、GPT-5 Medium、gpt-oss-20b），在有/无轨迹设置下比较。
在事后进行分析比对动态轨迹与地面真相接点，并进行 PoC 的人工检查。

Figure 1: Overview of the PoC-Gym pipeline which consists of three main stages: prompt construction, PoC generation, and PoC validation with feedback.

实验结果

研究问题

RQ1在使用 LLM 时，静态源–汇跟踪指导是否能提高 PoC 生成的成功率？
RQ2当与地面真相漏洞位置比对时，经过自动化验证后 PoC 的有效性如何？
RQ3多少比例的“成功” PoC 实际上是在利用目标漏洞路径上存在的误报？
RQ4执行级验证方法是否比仅输出性检查更能降低误报？
RQ5在 Java 漏洞的 PoC 生成管道中，常见的失败模式是什么？

主要发现

使用静态分析指导比前一个基线 FaultLine 的 PoC 生成成功率高出 21%。
事后分析显示，71.5% 的成功 PoC 在人工检查时无效。
若没有轨迹信息，自动化验证的表观成功率可高达 85%，但地面真相验证会显著降低这一数字。
轨迹指导降低了初始成功与地面真相有效性之间的差异，但并未消除所有无效 PoC。
PoC-Gym 在 Claude Sonnet 4 下，在 CWE-Bench-Java 项目中实现了最高 8/14 的事后成功，超过同组中 FaultLine 的 5/14。
相当大比例的失败来自于传播的 LLM 错误、错误标注的 CVE，以及对表层信号的依赖而非真正的漏洞可达性。

Figure 2: Distribution of the manual analysis results for the multi-trace runs. The plain run results are given in Appendix D.4 .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。