Skip to main content
QUICK REVIEW

[Paper Review] TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Jiangping Huang, Wenguang Ye|arXiv (Cornell University)|Feb 6, 2026
Software Engineering Research0 citations
TL;DR

TraceCoder uses a three-agent loop (Instrumentation, Analysis, Repair) with a Historical Lesson Learning Mechanism and Rollback to debug and repair LLM-generated code, achieving substantial Pass@1 gains over baselines.

ABSTRACT

Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.

Motivation & Objective

  • Motivate automated debugging of LLM-generated code to address subtle bugs beyond binary pass/fail signals.
  • Introduce a trace-driven, multi-agent architecture that mimics expert debugging (observe-analyze-repair).
  • Improve fault localization and repair efficiency through runtime traces and learning from past failures.
  • Enhance reliability and convergence with a Rollback Mechanism and a Historical Lesson Learning Mechanism.

Proposed method

  • Instrumentation Agent inserts diagnostic probes into faulty code to collect fine-grained runtime traces without altering semantics.
  • Analysis Agent performs causal reasoning on runtime traces and past failures to generate a repair plan and instrumentation suggestions.
  • Repair Agent applies the proposed repair plan to modify code and participates in iterative testing.
  • Historical Lesson Learning Mechanism (HLLM) distills lessons from failed repairs to inform future cycles.
  • Rollback Mechanism preserves and reverts to the best-known state to ensure steady improvement.
  • Shared artifact-based communication mediates iterative, structured feedback among agents.
  • Evaluations compare TraceCoder against baselines on HumanEval, HumanEval+, BigCodeBench, and ClassEval using Pass@1 as the metric.
Figure 1. Limitations of simple execution feedback. Without runtime insights, the model repeatedly applies local patches that degrade the code’s correctness, causing it to loop between incorrect versions rather than converging to a correct global solution.
Figure 1. Limitations of simple execution feedback. Without runtime insights, the model repeatedly applies local patches that degrade the code’s correctness, causing it to loop between incorrect versions rather than converging to a correct global solution.

Experimental results

Research questions

  • RQ1RQ1: How effective is TraceCoder at repairing LLM-generated code compared to advanced automated repair methods?
  • RQ2RQ2: How do TraceCoder’s key hyperparameters affect repair performance and stability?
  • RQ3RQ3: What is the contribution of each core component to TraceCoder’s overall effectiveness?
  • RQ4RQ4: How does TraceCoder perform in practice regarding reliability, cost efficiency, and failure modes compared to sampling-based strategies?

Key findings

  • TraceCoder achieves up to 34.43% relative improvement in Pass@1 accuracy on challenging class-level benchmarks.
  • Ablation shows iterative repair alone contributes 65.61% relative gain in accuracy.
  • TraceCoder outperforms leading iterative methods in both accuracy and cost efficiency.
  • The framework leverages fine-grained runtime traces, historical learning, and rollback to stabilize convergence.
Figure 2. Overview of TraceCoder’s workflow. ① An LLM generates an initial code solution. ② The code is executed and tested. A multi-agent debugging loop—comprising the Instrumentation, Analysis, and Repair Agents—emulates expert debugging behaviors by leveraging runtime tracing, HLLM, and RM to ena
Figure 2. Overview of TraceCoder’s workflow. ① An LLM generates an initial code solution. ② The code is executed and tested. A multi-agent debugging loop—comprising the Instrumentation, Analysis, and Repair Agents—emulates expert debugging behaviors by leveraging runtime tracing, HLLM, and RM to ena

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.