QUICK REVIEW

[论文解读] Scaling Test-Driven Code Generation from Functions to Classes: An Empirical Study

Yunhao Liang, Ruixuan Ying|arXiv (Cornell University)|Feb 3, 2026

Software Engineering Research被引用 0

一句话总结

本研究将测试驱动代码生成从函数级任务推广到类级综合，采用一个依赖感知的 TDD 框架，引入 ClassEval-TDD，并在八个大语言模型上展示了类级正确性的大幅提升。

ABSTRACT

Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to function-level tasks, leaving class-level synthesis where multiple methods interact through shared state and call dependencies underexplored. In this paper, we scale test-driven code generation from functions to classes via an iterative TDD framework. Our approach first analyzes intra-class method dependencies to derive a feasible generation schedule, and then incrementally implements each method under method-level public tests with reflection-style execution feedback and bounded repair iterations. To support test-driven generation and rigorous class-level evaluation, we construct ClassEval-TDD, a cleaned and standardized variant of ClassEval with consistent specifications, deterministic test environments, and complete method-level public tests. We conduct an empirical study across eight LLMs and compare against the strongest direct-generation baseline (the best of holistic, incremental, and compositional strategies). Our class-level TDD framework consistently improves class-level correctness by 12 to 26 absolute points and achieves up to 71% fully correct classes, while requiring only a small number of repairs on average. These results demonstrate that test-driven generation can effectively scale beyond isolated functions and substantially improve class-level code generation reliability. All code and data are available at https://anonymous.4open.science/r/ClassEval-TDD-C4C9/

研究动机与目标

将 TDD 的规模扩展从函数级软件到类级软件，其中方法共享状态与依赖关系。
引入依赖感知的调度以确定类内可行的方法生成顺序。
开发一个反思式修复机制，在有限预算内修复 TDD 生成的方法。
创建 ClassEval-TDD 作为一个清理、确定性的基准，用于可靠的类级 TDD 评估。
在多种 LLM 上进行实证评估并与强基线进行比较，以量化提升与失败模式。

提出的方法

分析同类内方法依赖以推导生成时间表，确保遵循先决关系。
在时间表中逐步实现每个方法，使用方法级公开测试作为可执行规范。
使用反思式修复循环来诊断失败、提出最小修补，并在修复预算内修补直至测试通过。
通过修复并标准化 ClassEval 来构建 ClassEval-TDD，确保纠正的文档字符串、对齐的骨架、确定性测试以及方法级公开测试的一致性。
在 Holistic、Incremental 和 Compositional 基线下对 8 个 LLM 进行评估，测量类级和函数级的成功率以及依赖性准确性。

实验结果

研究问题

RQ1RQ1：修复并将 ClassEval 标准化为 ClassEval-TDD 如何影响基线在策略与模型上的生成性能？
RQ2RQ2：LLMs 能否多准确地推断同类内方法依赖并生成可行的生成调度？
RQ3RQ3：依赖感知的类级 TDD 框架在提高类级正确性和减少修复方面有多有效？

主要发现

ClassEval-TDD 相对于 ClassEval，在所有模型和策略上持续提升了类级和函数级的成功率。
Incremental 生成在 ClassEval-TDD 上变得显著具备竞争力，常接近或超越 Holistic 的表现。
LLMs 的依赖推断显示出高召回率和强 F1，偶尔出现拓扑顺序违规，表明调度是一个独立的失败模式。
大多数依赖错误源于 No-deps 方法中的过度近似（extra_deps）或 With-deps 方法中的缺失依赖（missing_deps），突显出不同的瓶颈。
拓扑排序违规集中在少数任务上，这些任务的语义先验与实际依赖冲突。
总体而言，该框架在每个方法最多修复 3 次的有界修复预算下，达到最高 71% 的完全正确类。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。