[论文解读] XAI for Coding Agent Failures: Transforming Raw Execution Traces into Actionable Insights
本文提出一个结构化的 XAI 流水线,将原始编码代理执行轨迹转换为可解释的解释、可视化和可操作的修复建议,相较于原始轨迹或通用大语言模型的解释,具有更高的准确性和对失败的理解速度。
Large Language Model (LLM)-based coding agents show promise in automating software development tasks, yet they frequently fail in ways that are difficult for developers to understand and debug. While general-purpose LLMs like GPT can provide ad-hoc explanations of failures, raw execution traces remain challenging to interpret even for experienced developers. We present a systematic explainable AI (XAI) approach that transforms raw agent execution traces into structured, human-interpretable explanations. Our method consists of three key components: (1) a domain-specific failure taxonomy derived from analyzing real agent failures, (2) an automatic annotation system that classifies failures using defined annotation schema, (3) a hybrid explanation generator that produces visual execution flows, natural language explanations, and actionable recommendations. Through a user study with 20 participants (10 technical, 10 non-technical), we demonstrate that our approach enables users to identify failure root causes 2.8 times faster and propose correct fixes with 73% higher accuracy compared to raw execution traces. Importantly, our structured approach outperforms ad-hoc state of the art models explanations by providing consistent, domain-specific insights with integrated visualizations. Our work establishes a framework for systematic agent failure analysis, addressing the critical need for interpretable AI systems in software development workflows
研究动机与目标
- Develop a domain-specific taxonomy of coding agent failures.
- Automate the annotation of failures using a structured scheme.
- Build a hybrid explanation system with visual, textual, and actionable outputs.
- Empirically validate whether structured XAI outperforms raw traces and generic explanations.
提出的方法
- Derive a failure taxonomy from 32 real-world coding agent failures across varied experimental conditions.
- Create an automatic annotation system using GPT-4 with function-calling for structured outputs and confidence scoring.
- Develop an integrated XAI pipeline that generates execution-flow visualizations, natural language explanations, and counterfactual/recommendation analyses.
- Evaluate the approach through a user study (N=20) comparing against raw traces and general-purpose LLM explanations.

实验结果
研究问题
- RQ1What failure patterns occur in coding agents solving HumanEval tasks?
- RQ2Can automated annotation accurately classify failures into a domain-specific taxonomy?
- RQ3Do structured XAI explanations improve understanding, accuracy of root-cause identification, and quality of fixes compared with raw traces and generic LLM explanations?
- RQ4How do visualizations, explanations, and recommendations affect technical and non-technical users?
主要发现
| Group | Metric | Raw | General Purpose LLMs | Our XAI |
|---|---|---|---|---|
| Technical | Time to understand (min) | 8.4±2.1 | 5.2±1.3 | 3.0±0.8 |
| Technical | Root cause accuracy (%) | 42±15 | 68±12 | 89±8 |
| Technical | Fix quality (1-5) | 2.6±0.8 | 3.4±0.6 | 4.3±0.5 |
| Technical | Confidence (1-7) | 3.2±1.1 | 4.8±0.9 | 6.1±0.7 |
| Non-Technical | Time to understand (min) | 12.8±3.2 | 7.1±1.8 | 4.2±1.1 |
| Non-Technical | Root cause accuracy (%) | 18±12 | 52±18 | 76±11 |
| Non-Technical | Fix quality (1-5) | 1.4±0.6 | 2.8±0.7 | 3.8±0.6 |
| Non-Technical | Confidence (1-7) | 2.1±0.9 | 4.2±1.1 | 5.6±0.8 |
- Iterative Refinement Failures dominate (56% of 32 failures); exceeded iteration limits without progress is the most common pattern.
- Automatic classifier accuracy: 82.1% (26/32) with higher accuracy (90.5%) on high-confidence predictions and substantial agreement (Cohen’s kappa = 0.76).
- Our XAI system yields faster failure comprehension (2.8x) and higher root-cause accuracy (89% technical, 76% non-technical) vs baselines.
- Technical participants’ root-cause accuracy improved from 42% (raw) to 89% (Our XAI); non-technical improved from 18% (raw) to 76% (Our XAI).
- Fix proposals are rated higher under Our XAI (4.3/5 technical, 3.8/5 non-technical) than baselines.
- Users reported higher confidence with Our XAI (6.1/7 technical, 5.6/7 non-technical).

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。