QUICK REVIEW

[论文解读] CoverUp: Effective High Coverage Test Generation for Python

Juan Altmayer Pizzorno, Emery D. Berger|arXiv (Cornell University)|Mar 24, 2024

Software Testing and Debugging Techniques被引用 6

一句话总结

CoverUp 将覆盖分析与基于大语言模型的测试生成相结合，产出高覆盖率的 Python 回归测试，基于测量到的覆盖率迭代优化提示。

ABSTRACT

Testing is an essential part of software development. Test generation tools attempt to automate the otherwise labor-intensive task of test creation, but generating high-coverage tests remains challenging. This paper proposes CoverUp, a novel approach to driving the generation of high-coverage Python regression tests. CoverUp combines coverage analysis, code context, and feedback in prompts that iteratively guide the LLM to generate tests that improve line and branch coverage. We evaluate our prototype CoverUp implementation across a benchmark of challenging code derived from open-source Python projects and show that CoverUp substantially improves on the state of the art. Compared to CodaMosa, a hybrid search/LLM-based test generator, CoverUp achieves a per-module median line+branch coverage of 80% (vs. 47%). Compared to MuTAP, a mutation- and LLM-based test generator, CoverUp achieves an overall line+branch coverage of 89% (vs. 77%). We also demonstrate that CoverUp's performance stems not only from the LLM used but from the combined effectiveness of its components.

研究动机与目标

Aim to increase regression test coverage for Python programs beyond prior methods.
Develop a feedback loop where coverage analysis guides LLM-based test generation.
Evaluate CoverUp against the previous state of the art (CodaMosa) and analyze the value of iterative dialog.
Address practical challenges in LLM-based test generation, including integration checks and flaky tests.

提出的方法

Measure current test coverage with SlipCover and identify code segments lacking coverage.
Segment code into concise excerpts containing missing coverage and provide context for LLM prompts.
Prompt an LLM (GPT-4 Turbo) to generate tests for each missing-segment, highlighting uncovered lines/branches.
Execute generated tests, measure coverage, and continue dialog to fix issues or improve coverage.
Perform integration checks by running the full suite and disable failing tests or isolate causes when needed.
Handle practical issues such as missing modules, flaky tests, and asynchronous prompting to speed up generation.

实验结果

研究问题

RQ1RQ1：CoverUp 的覆盖率与先前的最先进方法（CodaMosa codex 和 gpt4）相比如何？
RQ2RQ2：在使用最先进的大模型（GPT-4）的情况下，CoverUp 的覆盖率与 CodaMosa 相比如何？
RQ3RQ3：CoverUp 连续对话在提高覆盖率方面有多有效？

主要发现

CoverUp 相较于 CodaMosa (codex) 在中位模组覆盖率上更高：line 81% vs 62%，branch 53% vs 35%，line+branch 78% vs 55%。
在所有代码中，CoverUp 将线覆盖率从 54% 提高到 61%，分支覆盖率从 34% 提高到 43%，line+branch 从 49% 提高到 57%。
CoverUp 在整个套件和按模块分析上，也比 CodaMosa (gpt4) 获得更高的覆盖率。
在 PY 基准套件上，CoverUp 在总体覆盖率和所有指标的中位覆盖率方面接近100%。
约 42.9% 的成功通过迭代提示实现，考虑最多三个提示时升至 49.2%，凸显持续对话的价值。
CoverUp 在 Pynguin 最初实现全覆盖的代码上也表现出色，表明其在面对困难模块时的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。