QUICK REVIEW

[논문 리뷰] TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Jiangping Huang, Wenguang Ye|arXiv (Cornell University)|2026. 02. 06.

Software Engineering Research인용 수 0

한 줄 요약

TraceCoder는 세 에이전트 루프 (Instrumentation, Analysis, Repair)와 Historical Lesson Learning Mechanism 및 Rollback을 사용하여 LLM-generated code를 디버깅하고 수정하며, baselines 대비 Pass@1의 상당한 이득을 달성한다.

ABSTRACT

Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.

연구 동기 및 목표

바이너리 패스/패일 신호를 넘는 미세한 버그를 해결하기 위해 LLM이 생성한 코드의 자동 디버깅을 촉진한다.
관찰-observe-analyze-repair 같은 전문가 디버깅을 모방하는 추적 기반의 다중 에이전트 아키텍처를 도입한다.
런타임 추적과 과거 실패로부터의 학습을 통해 결함 위치 파악과 수리 효율성을 개선한다.
Rollback Mechanism과 Historical Lesson Learning Mechanism으로 신뢰성과 수렴성을 향상시킨다.

제안 방법

Instrumentation Agent는 의미를 바꾸지 않으면서 세밀한 런타임 추적을 수집하기 위해 결함 코드에 진단 프로브를 삽입한다.
Analysis Agent는 런타임 추적과 과거 실패에 대해 인과적 추론을 수행하여 수리 계획과 진단 기기 제안을 생성한다.
Repair Agent가 제안된 수리 계획을 적용하여 코드를 수정하고 반복적 테스트에 참여한다.
Historical Lesson Learning Mechanism (HLLM)은 실패한 수정으로부터 교훈을 추출하여 향후 사이클에 정보를 제공한다.
Rollback Mechanism은 최적의 알려진 상태를 보존하고 복구하여 안정적인 개선을 보장한다.
Shared artifact-based communication이 에이전트 간의 반복적이고 구조화된 피드백을 중재한다.
Evaluations에서는 Pass@1을 지표로 사용하여 HumanEval, HumanEval+, BigCodeBench, ClassEval에서 TraceCoder를 baselines와 비교한다.

Figure 1. Limitations of simple execution feedback. Without runtime insights, the model repeatedly applies local patches that degrade the code’s correctness, causing it to loop between incorrect versions rather than converging to a correct global solution.

실험 결과

연구 질문

RQ1RQ1: 고급 자동 수리 방법과 비교할 때 TraceCoder가 LLM이 생성한 코드를 얼마나 효과적으로 수정하는가?
RQ2RQ2: TraceCoder의 주요 하이퍼파라미터가 수리 성능과 안정성에 어떤 영향을 미치는가?
RQ3RQ3: TraceCoder의 전반적인 효과성에 각 핵심 구성요소가 기여하는 바는 무엇인가?
RQ4RQ4: 샘플링 기반 전략과 비교했을 때 신뢰성, 비용 효율성 및 실패 모드 측면에서 실제로 TraceCoder의 성능은 어떠한가?

주요 결과

TraceCoder는 도전적인 클래스 수준 벤치마크에서 Pass@1 정확도에서 최대 34.43%의 상대 향상을 달성한다.
Ablation 결과, 반복적 수리만으로도 정확도에서 상대적 65.61%의 향상을 달성한다.
TraceCoder는 정확도와 비용 효율성 모두에서 선도적인 반복 방법들을 능가한다.
본 프레임워크는 세밀한 런타임 추적, 역사적 학습, Rollback을 활용하여 수렴을 안정화한다.

Figure 2. Overview of TraceCoder’s workflow. ① An LLM generates an initial code solution. ② The code is executed and tested. A multi-agent debugging loop—comprising the Instrumentation, Analysis, and Repair Agents—emulates expert debugging behaviors by leveraging runtime tracing, HLLM, and RM to ena

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.