Skip to main content
QUICK REVIEW

[论文解读] XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights

Arun Joshi|arXiv (Cornell University)|Mar 6, 2026
Explainable Artificial Intelligence (XAI)被引用 0
一句话总结

本文提出一个结构化的 XAI 流水线,将原始编码代理执行轨迹转换为可解释的解释、可视化和可操作的修复建议,相较于原始轨迹或通用大语言模型的解释,具有更高的准确性和对失败的理解速度。

ABSTRACT

Large Language Model (LLM)-based coding agents show promise in automating software development tasks, yet they frequently fail in ways that are difficult for developers to understand and debug. While general-purpose LLMs like GPT can provide ad-hoc explanations of failures, raw execution traces remain challenging to interpret even for experienced developers. We present a systematic explainable AI (XAI) approach that transforms raw agent execution traces into structured, human-interpretable explanations. Our method consists of three key components: (1) a domain-specific failure taxonomy derived from analyzing real agent failures, (2) an automatic annotation system that classifies failures using defined annotation schema, (3) a hybrid explanation generator that produces visual execution flows, natural language explanations, and actionable recommendations. Through a user study with 20 participants (10 technical, 10 non-technical), we demonstrate that our approach enables users to identify failure root causes 2.8 times faster and propose correct fixes with 73% higher accuracy compared to raw execution traces. Importantly, our structured approach outperforms ad-hoc state of the art models explanations by providing consistent, domain-specific insights with integrated visualizations. Our work establishes a framework for systematic agent failure analysis, addressing the critical need for interpretable AI systems in software development workflows

研究动机与目标

  • Develop a domain-specific taxonomy of coding agent failures.
  • Automate the annotation of failures using a structured scheme.
  • Build a hybrid explanation system with visual, textual, and actionable outputs.
  • Empirically validate whether structured XAI outperforms raw traces and generic explanations.

提出的方法

  • Derive a failure taxonomy from 32 real-world coding agent failures across varied experimental conditions.
  • Create an automatic annotation system using GPT-4 with function-calling for structured outputs and confidence scoring.
  • Develop an integrated XAI pipeline that generates execution-flow visualizations, natural language explanations, and counterfactual/recommendation analyses.
  • Evaluate the approach through a user study (N=20) comparing against raw traces and general-purpose LLM explanations.
Figure 1: System architecture showing the flow from raw trace to final explanation report. The system consists of three main components: automatic annotation, explanation generation, and report synthesis.
Figure 1: System architecture showing the flow from raw trace to final explanation report. The system consists of three main components: automatic annotation, explanation generation, and report synthesis.

实验结果

研究问题

  • RQ1What failure patterns occur in coding agents solving HumanEval tasks?
  • RQ2Can automated annotation accurately classify failures into a domain-specific taxonomy?
  • RQ3Do structured XAI explanations improve understanding, accuracy of root-cause identification, and quality of fixes compared with raw traces and generic LLM explanations?
  • RQ4How do visualizations, explanations, and recommendations affect technical and non-technical users?

主要发现

GroupMetricRawGeneral Purpose LLMsOur XAI
TechnicalTime to understand (min)8.4±2.15.2±1.33.0±0.8
TechnicalRoot cause accuracy (%)42±1568±1289±8
TechnicalFix quality (1-5)2.6±0.83.4±0.64.3±0.5
TechnicalConfidence (1-7)3.2±1.14.8±0.96.1±0.7
Non-TechnicalTime to understand (min)12.8±3.27.1±1.84.2±1.1
Non-TechnicalRoot cause accuracy (%)18±1252±1876±11
Non-TechnicalFix quality (1-5)1.4±0.62.8±0.73.8±0.6
Non-TechnicalConfidence (1-7)2.1±0.94.2±1.15.6±0.8
  • Iterative Refinement Failures dominate (56% of 32 failures); exceeded iteration limits without progress is the most common pattern.
  • Automatic classifier accuracy: 82.1% (26/32) with higher accuracy (90.5%) on high-confidence predictions and substantial agreement (Cohen’s kappa = 0.76).
  • Our XAI system yields faster failure comprehension (2.8x) and higher root-cause accuracy (89% technical, 76% non-technical) vs baselines.
  • Technical participants’ root-cause accuracy improved from 42% (raw) to 89% (Our XAI); non-technical improved from 18% (raw) to 76% (Our XAI).
  • Fix proposals are rated higher under Our XAI (4.3/5 technical, 3.8/5 non-technical) than baselines.
  • Users reported higher confidence with Our XAI (6.1/7 technical, 5.6/7 non-technical).
Figure 2: Exectuion Flow
Figure 2: Exectuion Flow

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。