QUICK REVIEW

[論文レビュー] Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All

Chenxi Huang, Alex Mathai|arXiv (Cornell University)|Feb 2, 2026

Security and Verification in Computing被引用数 0

ひとこと要約

The paper introduces Live-kBench and kEnv for self-evolving, agent-agnostic evaluation of LLM-based kernel crash-resolution, showing time-aware performance gaps and CRF-driven improvements on a 534-bug inaugural dataset.

ABSTRACT

Repairing system crashes discovered by kernel fuzzers like Syzkaller is a critical yet underexplored challenge in software engineering. While recent works have introduced Large Language Model (LLM) based agents for Linux kernel crash-resolution, their evaluation benchmarks are usually static and thus, do not capture the evolving nature of the Linux kernel, and suffer from potential data contamination due to LLM knowledge cutoffs. To address the above problem, we present (i) Live-kBench, an evaluation framework for self-evolving benchmarks that continuously scrapes and evaluates agents on freshly discovered kernel bugs, and (ii) kEnv, an agent-agnostic standardized crash-resolution environment for kernel compilation, execution, and feedback. This design decouples agent workflows from heavy-weight execution, enabling fair and scalable comparison across diverse agent frameworks under identical conditions. To this end, we curate an inaugural dataset of 534 Linux kernel bugs and empirically demonstrate a significant performance gap, with agents achieving up to 25% higher equivalent patch rate on bugs fixed before the LLM knowledge cutoff. Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt (plausible patches); however only ~20% of generated patches closely match developer fixes. Additionally, exposing crash resolution feedback improves crash resolution rate by 29%. Live-kBench provides the community with an evaluation infrastructure for self-evolving benchmarks that is both time and attribute sensitive; complete with a public dashboard to track agent progress on Linux kernel bugs.

研究の動機と目的

Motivate the need for live, time-aware evaluation of kernel crash-resolution by LLM-based agents.
Develop an agent-agnostic environment (kEnv) to standardize kernel crash workflows across agents.
Create Live-kBench, a self-evolving benchmark that continuously curates fresh kernel bugs for evaluation.
Provide metrics and dashboards to analyze crash resolution, localization, and patch equivalence across time and bug attributes.

提案手法

Introduce kEnv as a standardized, agent-agnostic crash-resolution environment that compiles and tests patched kernels via a common interface.
Integrate a crash-resolution feedback (CRF) tool to enable iterative patch refinement by agents.
Build Live-kBench by continuously scraping fresh kernel bugs from Syzbot, reproducing them, and evaluating agent patches via a public dashboard.
Use a dataset Live-kBench-2512 with 534 bugs to benchmark multiple agents across LLMs and scaffolds under time-aware conditions.

Figure 1 : Interaction between kEnv and Live-kBench . A kEnv instance is brought up and merged with an agentic-specific overlay ( 1 ). A patch generation request from Live-kBench invokes the agent ( 2 ), and it runs within the kEnv instance ( 3 ). Finally, the patch is submitted to Live-kBench 4 .

実験結果

リサーチクエスチョン

RQ1What is the performance difference of LLM-based agents on kernel crash-resolution tasks before and after the LLM knowledge cutoff?
RQ2How does the choice of agent scaffold affect kernel crash-resolution performance?
RQ3What is the impact of different LLM backends on crash-resolution performance and localization?
RQ4What is the upper-bound crash-resolution performance with perfect localization?
RQ5Is crash-resolution feedback (CRF) beneficial for autonomous agents?

主な発見

Agents resolve about 74% of crashes on the first attempt (plausible patches) but only ~20% of patches are equivalent to developer fixes.
Agents show up to 25% higher equivalent patch rate on bugs fixed before the cutoff compared to after the cutoff.
Crash-resolution feedback (CRF) improves crash-resolution rate by about 29% but does not change localization IoU or equivalence.
Providing perfect localization (oracle mode) raises equivalence by ~12% while slightly reducing crash-resolution rate.
Test-time scaling improves crash-resolution rate to around 90% and patch equivalence to ~30% under unlimited budget.
Live-kBench demonstrates time-aware, attribute-based evaluation and a community dashboard for analysis across bug attributes and agent scaffolds.

Figure 2 : Live Benchmarking. Live-kBench first curates kernel bugs from Syzbot ( 1 ), filters out bugs that are reliably triggered ( 2 ), executes agents on the bugs using kEnv ( 3 ), and computes and stores metrics that are finally displayed to a dashboard ( 4 ).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。