QUICK REVIEW

[论文解读] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

Cen Zhang, Younggi Park|arXiv (Cornell University)|Feb 7, 2026

Adversarial Robustness in Machine Learning被引用 0

一句话总结

本综述分析 DARPA 的 AIxCC 最终竞赛（AFC 2023–2025），详述设计决策、CRS 架构、结果及对未来自主漏洞发现与修补研究的启示。

ABSTRACT

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

研究动机与目标

评估 AIxCC 如何设计以指导和评估在开源软件中的自主漏洞发现与修补。
描述入围 Cyber Reasoning Systems (CRSs) 的架构与技术方法。
分析除了最终记分板之外的竞赛结果，以识别真实的性能驱动因素与局限性。
推导可执行的经验教训，用于组织未来的竞赛以及在实践中部署自主 CRS。
提供关于将竞赛结果转化为研究价值和行业部署考量的指南。

提出的方法

系统分析 AFC 的设计文档、来自七个入围CRS 的代码库，以及主办方的竞赛数据库（挑战、结果、轨迹）。
通过与主办方和入围团队的讨论对技术方法进行交叉验证。
对每个 CPV（挑战漏洞）进行注释并在受控设置中与基础漏洞发现与修补技术进行比较。
综合提炼竞赛设计与 CRS 部署的教训与未来方向。

Figure 1 : AFC workflow. GitHub webhooks trigger challenge dispatch and CRSs submit results via the Competition API. Each CRS operates in an isolated network with access to the Competition API, build dependencies, and LLM endpoints.

实验结果

研究问题

RQ1RQ1: AIxCC 如何设计以指导和评估 AI 驱动的漏洞发现与修补？
RQ2RQ2: 入围团队采用了哪些架构和技术方法？
RQ3RQ3: 竞赛结果揭示了哪些见解？
RQ4RQ4: 组织竞赛与部署自治 CRS 的经验教训与未来方向是什么？

主要发现

AIxCC 将真实世界的 OSS 嵵嵌式工作流（全量扫描、增量扫描、SARIF 审核与报告整合）与时延衰减评分结合，以在发现与修补质量之间取得平衡。
在七个入围 CRS 中，稳定性和准确性是绩效的主要决定因素，AT 由于在各阶段的持续活跃性而获得最高总分。
团队使用两条互补的 POV 流水线（模糊测试增强与基于大模型的 POV 生成）并采用多架构集成/多代理对比单代理设计来生成修补。
SARIF 验证策略各异（以 POV 为中心、以大模型评判者为中心、以漏洞候选为中心），影响报告与验证对分数的贡献。
捆绑策略将 POV、修补与 SARIF 评估关联起来，能够实现一致的漏洞报告，但也存在对错误配对的惩罚风险。
最终结果显示 Java CPV 在可比性方面具有意义，TI 在 POV 得分上表现强劲，AT 在修补和捆绑方面出色，而 AC 的稳定性与准确性在竞争结果中起决定性作用。

Figure 2 : Score per time (top) and phase (bottom) axes.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。