QUICK REVIEW

[论文解读] TOGLL: Correct and Strong Test Oracle Generation with LLMs

Soneya Binta Hossain, Matthew B. Dwyer|arXiv (Cornell University)|May 6, 2024

Scientific Computing and Data Management被引用 7

一句话总结

本论文研究对代码 LLM 进行微调以生成正确、强大且多样的测试 oracle，提出 TOGLL 并展示在未见的 Java 项目上相较 TOGA 和 EvoSuite 在准确性、多样性和基于变异的缺陷检测方面的显著提升。

ABSTRACT

Test oracles play a crucial role in software testing, enabling effective bug detection. Despite initial promise, neural-based methods for automated test oracle generation often result in a large number of false positives and weaker test oracles. While LLMs have demonstrated impressive effectiveness in various software engineering tasks, including code generation, test case creation, and bug fixing, there remains a notable absence of large-scale studies exploring their effectiveness in test oracle generation. The question of whether LLMs can address the challenges in effective oracle generation is both compelling and requires thorough investigation. In this research, we present the first comprehensive study to investigate the capabilities of LLMs in generating correct, diverse, and strong test oracles capable of effectively identifying a large number of unique bugs. To this end, we fine-tuned seven code LLMs using six distinct prompts on the SF110 dataset. Utilizing the most effective fine-tuned LLM and prompt pair, we introduce TOGLL, a novel LLM-based method for test oracle generation. To investigate the generalizability of TOGLL, we conduct studies on 25 large-scale Java projects. Besides assessing the correctness, we also assess the diversity and strength of the generated oracles. We compare the results against EvoSuite and the state-of-the-art neural method, TOGA. Our findings reveal that TOGLL can produce 3.8 times more correct assertion oracles and 4.9 times more exception oracles. Moreover, our findings demonstrate that TOGLL is capable of generating significantly diverse test oracles. It can detect 1,023 unique bugs that EvoSuite cannot, which is ten times more than what the previous SOTA neural-based method, TOGA, can detect.

研究动机与目标

研究微调后的代码 LLM 是否能够为软件测试生成正确且强大的测试 oracle。
评估 LLM 生成的 oracle 在未见的大型 Java 项目中的泛化能力。
评估 LLM 生成的 oracle 相对于最先进基线的多样性和缺陷检测强度。
提供数据集、模型和代码，以实现基于 LLM 的测试 oracle 生成的重复性研究与后续研究。

提出的方法

对七个代码 LLM（110M–2.7B 参数）在一个来自 SF110 的测试前缀、MUT 和文档字符串数据集上进行微调，六个提示在上下文上有所不同。
基于验证集上的准确性选择最佳模型-提示对，以定义 TOGLL。
通过执行整合生成 oracle 的测试套件来评估正确性，测量成功率（非空、通过的正确 oracle）。
在 25 个未见的大型 Java 项目中将 TOGLL 与 TOGA（最先进的神经方法）和 EvoSuite 进行对比。
通过使用 PIT 的变异测试来评估 oracle 的强度，即衡量变体的检测与被杀死的独特变体数量。
分析生成断言的多样性及其在常见断言类别中的分布。

Figure 1: Overview of our approach to explore LLM-based oracle generation and to evaluate TOGLL.

实验结果

研究问题

RQ1RQ1: 哪种 LLM 和提示方法在生成准确性方面能产生最有效的测试 oracle？
RQ2RQ2: 与基线相比，TOGLL 微调模型在未见项目上生成正确测试 oracle 的能力如何？
RQ3RQ3: 与 EvoSuite 生成的断言相比，LLM 生成的断言有多大多样性？
RQ4RQ4: 在变异测试中，生成的断言在识别独特缺陷方面有多强？

主要发现

TOGLL 在未见项目上生成正确 oracle 的能力高于 TOGA，断言 oracle 提升最多 3.8 倍，异常 oracle 提升 4.9 倍。
TOGLL 生成的断言在多样性方面显著高于 EvoSuite，具备许多独特的观测目标，而在 194,871 个生成的断言中只有 18,630 处于完全匹配。
在未见项目上，TOGLL 发现了 1,023 个独特的变体（是 TOGA 的 10 倍以上，且显著多于 EvoSuite），表明具有强大的缺陷检测能力。
提示上下文很重要：增加方法签名或完整方法代码可以提高准确性，在模型中 P5（整个 MUT）和 P6（文档+MUT）通常表现最好；仅文档字符串的增益较小。
在评估的模型中，CodeGen-350M 与 CodeParrot-110M 是在最有效提示（P4–P6）下的总体最佳表现者。
TOGLL 在 25 个真实世界项目中保持强劲表现，断言正确 oracle 的平均成功率为 63%，异常的为 93.4%，显著优于 TOGA。

Figure 2: EvoSuite-Generated Test Cases with Assertion and Exception Oracles. The prefix part is marked with yellow color.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。