QUICK REVIEW

[论文解读] XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights

Arun Joshi|arXiv (Cornell University)|Mar 6, 2026

Explainable Artificial Intelligence (XAI)被引用 0

一句话总结

本文提出一个结构化的 XAI 流水线，将原始编码代理执行轨迹转换为可解释的解释、可视化和可操作的修复建议，相较于原始轨迹或通用大语言模型的解释，具有更高的准确性和对失败的理解速度。

ABSTRACT

Large Language Model (LLM)-based coding agents show promise in automating software development tasks, yet they frequently fail in ways that are difficult for developers to understand and debug. While general-purpose LLMs like GPT can provide ad-hoc explanations of failures, raw execution traces remain challenging to interpret even for experienced developers. We present a systematic explainable AI (XAI) approach that transforms raw agent execution traces into structured, human-interpretable explanations. Our method consists of three key components: (1) a domain-specific failure taxonomy derived from analyzing real agent failures, (2) an automatic annotation system that classifies failures using defined annotation schema, (3) a hybrid explanation generator that produces visual execution flows, natural language explanations, and actionable recommendations. Through a user study with 20 participants (10 technical, 10 non-technical), we demonstrate that our approach enables users to identify failure root causes 2.8 times faster and propose correct fixes with 73% higher accuracy compared to raw execution traces. Importantly, our structured approach outperforms ad-hoc state of the art models explanations by providing consistent, domain-specific insights with integrated visualizations. Our work establishes a framework for systematic agent failure analysis, addressing the critical need for interpretable AI systems in software development workflows

研究动机与目标

Develop a domain-specific taxonomy of coding agent failures.
Automate the annotation of failures using a structured scheme.
Build a hybrid explanation system with visual, textual, and actionable outputs.
Empirically validate whether structured XAI outperforms raw traces and generic explanations.

提出的方法

Derive a failure taxonomy from 32 real-world coding agent failures across varied experimental conditions.
Create an automatic annotation system using GPT-4 with function-calling for structured outputs and confidence scoring.
Develop an integrated XAI pipeline that generates execution-flow visualizations, natural language explanations, and counterfactual/recommendation analyses.
Evaluate the approach through a user study (N=20) comparing against raw traces and general-purpose LLM explanations.

Figure 1: System architecture showing the flow from raw trace to final explanation report. The system consists of three main components: automatic annotation, explanation generation, and report synthesis.

实验结果

研究问题

RQ1What failure patterns occur in coding agents solving HumanEval tasks?
RQ2Can automated annotation accurately classify failures into a domain-specific taxonomy?
RQ3Do structured XAI explanations improve understanding, accuracy of root-cause identification, and quality of fixes compared with raw traces and generic LLM explanations?
RQ4How do visualizations, explanations, and recommendations affect technical and non-technical users?

主要发现

Group	Metric	Raw	General Purpose LLMs	Our XAI
Technical	Time to understand (min)	8.4±2.1	5.2±1.3	3.0±0.8
Technical	Root cause accuracy (%)	42±15	68±12	89±8
Technical	Fix quality (1-5)	2.6±0.8	3.4±0.6	4.3±0.5
Technical	Confidence (1-7)	3.2±1.1	4.8±0.9	6.1±0.7
Non-Technical	Time to understand (min)	12.8±3.2	7.1±1.8	4.2±1.1
Non-Technical	Root cause accuracy (%)	18±12	52±18	76±11
Non-Technical	Fix quality (1-5)	1.4±0.6	2.8±0.7	3.8±0.6
Non-Technical	Confidence (1-7)	2.1±0.9	4.2±1.1	5.6±0.8

Iterative Refinement Failures dominate (56% of 32 failures); exceeded iteration limits without progress is the most common pattern.
Automatic classifier accuracy: 82.1% (26/32) with higher accuracy (90.5%) on high-confidence predictions and substantial agreement (Cohen’s kappa = 0.76).
Our XAI system yields faster failure comprehension (2.8x) and higher root-cause accuracy (89% technical, 76% non-technical) vs baselines.
Technical participants’ root-cause accuracy improved from 42% (raw) to 89% (Our XAI); non-technical improved from 18% (raw) to 76% (Our XAI).
Fix proposals are rated higher under Our XAI (4.3/5 technical, 3.8/5 non-technical) than baselines.
Users reported higher confidence with Our XAI (6.1/7 technical, 5.6/7 non-technical).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。