QUICK REVIEW

[论文解读] CodeT: Code Generation with Generated Tests

Bei Chen, Fengji Zhang|arXiv (Cornell University)|Jul 21, 2022

Software Testing and Debugging Techniques被引用 64

一句话总结

CodeT 会自动生成测试用例，使用与代码生成相同的预训练语言模型，然后采用双执行一致性从多个样本中筛选最佳代码解，提升在若干基准和模型上的 pass@1。

ABSTRACT

The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select the most appropriate solution from the multiple samples generated by the pre-trained language models. A natural way to evaluate the quality and correctness of a code solution is to run it against a set of test cases, but the manual creation of such test cases is often costly and time-consuming. In this paper, we propose a novel method, CodeT, that leverages the same pre-trained language models to automatically generate test cases for the code samples, thus reducing the human effort and increasing the coverage of the test scenarios. CodeT then executes the code samples using the generated test cases, and performs a dual execution agreement, which considers both the consistency of the outputs against the generated test cases and the agreement of the outputs with other code samples. We conduct comprehensive experiments on four benchmarks, HumanEval, MBPP, APPS and CodeContests, using five different pre-trained language models with varying sizes and capabilities. Our results show that CodeT can significantly improve the performance of code solution selection over previous methods, achieving remarkable and consistent gains across different models and benchmarks. For instance, CodeT improves the pass@1 metric on HumanEval to 65.8%, which represents an absolute improvement of 18.8% over the code-davinci-002 model, and an absolute improvement of more than 20% over the previous state-of-the-art results.

研究动机与目标

减少对手工设计测试用例的依赖，通过使用与代码生成相同的 LM 自动生成测试用例。
通过基于执行的共识，利用生成的测试来改进从多个样本中选取代码解的能力。
通过测试结果与跨解一致性的双重一致性，提升评估的鲁棒性和覆盖率。
在零-shot 设置下，在多个基准和模型家族中展示有效性。

提出的方法

通过提示用于代码生成的同一预训练 LM 输出输入-输出对，为每个编程问题生成测试用例。
使用 LM 从问题上下文生成大量代码解 X，无需标注数据。
应用受 RANSAC 启发的双执行一致性，找到通过常见测试用例且彼此一致的 (代码, 测试) 对集合。
按 f(S) = |Sx| * |Sy| 对一致性集合进行排序，并从顶级一致集合中选取最佳代码解。
可选地，对解进行去重，并在有无去重的情况下对比性能（消融显示影响较小）。
使用多个基准和 LM 家族的零-shot 设置，在不依赖真实标签数据的前提下，使用生成的测试用例进行 pass@k 评估。

实验结果

研究问题

RQ1LM 生成的测试用例在质量和覆盖范围方面对驱动代码选择有多大帮助？
RQ2双执行一致性是否能在不同模型和基准上改进正确解的选择？
RQ3CodeT 在零-shot 设置下在多样基准和模型规模中的表现如何？
RQ4CodeT 对生成测试用例数量及测试用例质量（毒性、准确性、覆盖范围）有多敏感？

主要发现

CodeT 在基准和模型上显著提升 pass@1，例如在 HumanEval 使用 code-davinci-002 时：基线 47.0%，CodeT 65.8%。
在 MBPP 上，code-davinci-002 的 pass@1 从 58.1% 提升到 67.7%。
对于 APPS Introductory，pass@1 从 27.2% 提升到 34.6%。
在 CodeContests，pass@1 从 0.7% 提升到 2.1%（零-shot）。
CodeT 在 Codex、InCoder、CodeGen 家族中均取得稳定增益，并且在所有报告的设置中超越 AlphaCode-C。
测试用例质量（准确性、毒性、覆盖）与 CodeT 增益相关；较高质量的测试用例（例如来自 code-davinci-002）带来更大改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。