QUICK REVIEW

[論文レビュー] SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

Cen Zhang, Younggi Park|arXiv (Cornell University)|Feb 7, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

This SoK analyzes DARPA's AIxCC final competition (AFC 2023–2025), detailing design decisions, CRS architectures, results, and lessons for future autonomous vulnerability discovery and patching research.

ABSTRACT

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

研究の動機と目的

AIxCC がオープンソースソフトウェアにおける自动作漏洞検出とパッチ適用を導く設計と評価をどのように行っているかを評価する。
ファイナリストCRS（Cyber Reasoning Systems）のアーキテクチャ的・技術的アプローチを特徴づける。
最終スコアボードを超える競技結果を分析し、実際の性能ドライバーと制約を特定する。
今後の競技の組織と自動CRSの実務適用に向けた実行可能な教訓を導く。
競技結果を研究価値と産業導入の観点へ翻訳する際の指針を提供する。

提案手法

AFC設計文書、7つのファイナリストCRSのコードベース、および主催者の競技データベース（チャレンジ、結果、トレース）を体系的に分析する。
主催者とファイナリストとの議論を通じた技術的アプローチの跨ぎ検証。
各CPV（チャレンジ脆弱性）注釈と、制御された環境での基盤的な脆弱性発見とパッチ適用技術との比較。
競技設計とCRS展開の教訓と今後の方向性を統合する。

Figure 1 : AFC workflow. GitHub webhooks trigger challenge dispatch and CRSs submit results via the Competition API. Each CRS operates in an isolated network with access to the Competition API, build dependencies, and LLM endpoints.

実験結果

リサーチクエスチョン

RQ1RQ1：AIxCCはAI駆動の脆弱性発見とパッチ適用をどのように導き、評価するよう設計されているか。
RQ2RQ2：ファイナリストチームはどのようなアーキテクチャ的・技術的アプローチを採用したか。
RQ3RQ3：競技結果からどのような洞察が得られるか。
RQ4RQ4：競技を組織する際の教訓と、自動CRSを導入する際の今後の方向性は何か。

主な発見

AIxCC は実世界のOSS組み込みワークフロー（フルスキャン、デルタスキャン、SARIFレビュー、レポート統合）を時間減衰スコアリングと組み合わせて、発見とパッチ適用の品質のバランスを取る。
7つのファイナリストCRSのうち、安定性と正確性が性能の主要決定要因となり、ATが各フェーズでの継続的な活動により総スコアの最高値を達成した。
チームは2つの補完的なPoVパイプライン（ファズィング強化とLLMベースのPoV生成）を用い、パッチ生成アーキテクチャを多_archエンサンブル、マルチエージェント対1エージェント設計などで変化させた。
SARIF検証戦略はPoV中心、LLM判定中心、バグ候補中心などさまざまで、レポートと検証がスコアに寄与する様子に影響を与えた。
PoV、パッチ、SARIF評価を結び付けるバンドリング戦略により、一貫した脆弱性レポートを実現しつつ、誤組み合わせによるペナルティのリスクも生じた。
最終結果はJava CPVが意味のある比較を促進し、TIが力強いPoVスコアを達成し、ATがパッチ適用とバンドリングで卓越する一方、ACの安定性と正確性が競争結果を大きく左右した。

Figure 2 : Score per time (top) and phase (bottom) axes.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。