QUICK REVIEW

[论文解读] An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Max Schäfer, Sarah Nadi|arXiv (Cornell University)|Feb 13, 2023

Software Testing and Debugging Techniques被引用 53

一句话总结

论文提出 TestPilot，一种基于自适应的 LLM 驱动工具，用于在不需要额外训练的情况下生成 JavaScript 单元测试，在 25 个 npm 包中实现高覆盖率并促进多样化、非拷贝的测试。它还与 Nessie 进行比较，并探索不同提示组件和 LLM 的影响。

ABSTRACT

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.

研究动机与目标

推动自动化生成单元测试以减少开发者工作量。
评估现成的 LLM 是否能在不微调的情况下生成有效的单元测试。
评估 LL M 生成的测试的覆盖率与测试质量（断言、非平凡断言）。
分析提示组件对测试生成效果的影响。
将 TestPilot 与现有的测试生成技术以及跨多个 LLM 进行比较。

提出的方法

使用 prompt 的测试生成，采用一个包含函数签名、文档和用例的 prompts 的 gpt3.5-turbo。
自适应再提示：若生成的测试失败，使用失败和错误信息重新提示以修正测试。
五部分的 TestPilot 架构：API Explorer、Documentation Miner、Prompt Generator、Test Validator、和 Prompt Refiner。
通过在运行时检查包来动态发现 JavaScript 的 API，以识别可测试的函数。
基于 Mocha 的测试生成和执行，用于验证和改进生成的测试。
与 Nessie 的比较实验，以及与替代 LLM 的比较（code-cushman-002 和 StarCoder）。

实验结果

研究问题

RQ1RQ1 TestPilot 生成的测试达到多少语句覆盖率和分支覆盖率？
RQ2RQ2 在去除或包含不同信息组件（主体、用例、文档注释）时，TestPilot 的提示有多有效？
RQ3RQ3 TestPilot 在不同的 LLM 上的表现如何（GPT-3.5-turbo、code-cushman-002、StarCoder）？
RQ4RQ4 生成的测试与现有测试的相似程度有多高（即它们是记忆化还是从训练数据中复制？）
RQ5RQ5 生成的测试是否包含非平凡的断言来实现功能？

主要发现

25 个 npm 包上实现的中位语句覆盖率为 70.2%，分支覆盖率为 52.8%。
Nessie 对比得到的语句覆盖率为 51.3%，分支覆盖率为 25.6%。
TestPilot 测试有 92.8% 与现有测试相似度 <= 50%（没有完全相同的拷贝）。
60.0% 的测试与现有测试的相似性<=40%（且 92.8% <= 50%）。
自适应再提示修复了大约 15.6% 的失败测试。
使用 code-cushman-002（68.2% stat，51.2% branch）和 StarCoder（54.0% stat，37.5% branch）时，结果在定性上相似。
所有五个提示组件对于高质量的测试生成至关重要；移除任何组件都会降低效果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。