QUICK REVIEW

[论文解读] Software Testing with Large Language Models: Survey, Landscape, and Vision

Junjie Wang, Yuchao Huang|arXiv (Cornell University)|Jul 14, 2023

Software Testing and Debugging Techniques被引用 27

一句话总结

一项全面的综述，分析了102项研究，关于在软件测试中使用大语言模型（LLMs），详细描述任务、LLM使用、提示、挑战与未来方向。

ABSTRACT

Pre-trained large language models (LLMs) have recently emerged as a breakthrough technology in natural language processing and artificial intelligence, with the ability to handle large-scale datasets and exhibit remarkable performance across a wide range of tasks. Meanwhile, software testing is a crucial undertaking that serves as a cornerstone for ensuring the quality and reliability of software products. As the scope and complexity of software systems continue to grow, the need for more effective software testing techniques becomes increasingly urgent, making it an area ripe for innovative approaches such as the use of LLMs. This paper provides a comprehensive review of the utilization of LLMs in software testing. It analyzes 102 relevant studies that have used LLMs for software testing, from both the software testing and LLMs perspectives. The paper presents a detailed discussion of the software testing tasks for which LLMs are commonly used, among which test case preparation and program repair are the most representative. It also analyzes the commonly used LLMs, the types of prompt engineering that are employed, as well as the accompanied techniques with these LLMs. It also summarizes the key challenges and potential opportunities in this direction. This work can serve as a roadmap for future research in this area, highlighting potential avenues for exploration, and identifying gaps in our current understanding of the use of LLMs in software testing.

研究动机与目标

在测试生命周期内绘制 LLM 在软件测试中的应用全景。
描述最常被 LLMs 处理的软件测试任务（例如测试用例准备、调试、漏洞修复）。
分析这些研究中使用的 LLM 技术、提示策略及相关方法。
识别关键挑战与机遇，以指导未来的研究与实践。

提出的方法

在 2019–2023 年间，跨 ACM/IEEE/arXiv/DBLP 和顶级 SE/AI 会议的自动与手动文献检索，扩展至 2023 年 10 月。
使用纳入/排除标准筛选明确将 LLM 应用于软件测试任务的论文。
采用评分量表（至少八分）进行质量评估，以确保严谨性。
向后溯源的雪球法以扩充文献覆盖。
从软件测试视角（任务、覆盖范围）与 LLM 视角（模型、提示、技术）对研究进行分类。
综合趋势、局限性，以及未来工作的路线图。

实验结果

研究问题

RQ1LLMs 通常解决哪些软件测试任务（如测试用例准备、程序修复、测试Oracle生成、系统输入生成）？
RQ2在各项研究中使用了哪些 LLM、提示类型、输入模态以及相关技术？
RQ3在 LLM 支持的测试中，现行的评估方法及报道的性能趋势是什么？
RQ4将 LLM 应用到软件测试中仍存在哪些不足、挑战和机遇，如何加以应对？

主要发现

LLMs 最常被应用于测试用例准备、程序调试和漏洞修复。
大约三分之一的研究使用 LLM 的预训练或微调，其余则依赖提示工程。
零-shot 和少样本提示是最常见的策略；思路链（chain-of-thought）和自一致性（self-consistency）较少使用。
传统测试技术（如差分测试、变异测试）常与 LLM 结合以增强测试用例生成。
在将 LLM 应用于测试早期生命周期阶段以及非功能性测试方面存在显著空白，提示未来研究方向。
该综述提供了一个路线图，突出发表趋势、典型发表渠道，以及加速采用与实践的空白。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。