QUICK REVIEW

[论文解读] Fault Detection Effectiveness of Metamorphic Relations Developed for Testing Supervised Classifiers

Prashanta Saha, Upulee Kanewala|arXiv (Cornell University)|Jan 1, 2019

Software Testing and Debugging Techniques参考文献 22被引用 1

一句话总结

本研究评估了用于测试监督分类器的元测试关系（MRs）在故障检测方面的有效性，特别针对k-最近邻（k-NN）算法，使用了709个可到达的变异体。尽管先前声称其有效性很高，但仅有14.8%的变异体被检测到，表明基于用户期望的MRs实际效果远低于先前报告的水平。

ABSTRACT

In machine learning, supervised classifiers are used to obtain predictions for unlabeled data by inferring prediction functions using labeled data. Supervised classifiers are widely applied in domains such as computational biology, computational physics and healthcare to make critical decisions. However, it is often hard to test supervised classifiers since the expected answers are unknown. This is commonly known as the \emph{oracle problem} and metamorphic testing (MT) has been used to test such programs. In MT, metamorphic relations (MRs) are developed from intrinsic characteristics of the software under test (SUT). These MRs are used to generate test data and to verify the correctness of the test results without the presence of a test oracle. Effectiveness of MT heavily depends on the MRs used for testing. In this paper we have conducted an extensive empirical study to evaluate the fault detection effectiveness of MRs that have been used in multiple previous studies to test supervised classifiers. Our study uses a total of 709 reachable mutants generated by multiple mutation engines and uses data sets with varying characteristics to test the SUT. Our results reveal that only 14.8\% of these mutants are detected using the MRs and that the fault detection effectiveness of these MRs do not scale with the increased number of mutants when compared to what was reported in previous studies.

研究动机与目标

通过实证方法评估先前研究中用于测试监督分类器的元测试关系（MRs）的故障检测有效性。
解决先前研究中使用极少数变异体（例如22–24个）进行评估的局限性。
探究基于用户期望的MRs是否能可靠检测真实世界监督分类器实现中的故障。
考察测试数据集大小变化对MR故障检测有效性的影响。

提出的方法

使用两个变异工具——MuJava和Major——在Weka库中的真实k-NN实现上生成了709个可到达的变异体。
应用10个基于用户期望和k-NN算法特性的预定义MRs，生成后续测试用例。
使用不同数据集大小的源测试用例，评估MRs在不同输入条件下的鲁棒性。
通过比较预期输出与实际输出的变化，将故障检测有效性度量为每个MR所杀死的变异体（即被检测到的变异体）所占的百分比。
对MuJava和Major工具的变异体杀死率进行对比分析，以评估特定MRs的一致性和主导性。
分析MRs与变异体杀死率之间的相关性，以识别最有效的关系。

实验结果

研究问题

RQ1基于用户期望开发的MRs在检测监督分类器故障方面的有效性如何？
RQ2与先前使用小规模变异体集合的研究相比，增加评估中使用的变异体数量是否显著改变故障检测的有效性？
RQ3用作源测试用例的输入数据集大小是否会影响MRs的故障检测有效性？
RQ4哪些MRs在检测变异体方面最有效，且在不同变异工具之间是否具有一致性？
RQ5基于用户期望的MRs与基于算法特性的MRs相比，其故障检测效果优劣如何？

主要发现

在709个可到达的变异体中，仅有14.8%被10个MRs检测到，表明尽管先前声称其有效性很高，但实际故障检测效果仍较低。
MRs的故障检测有效性并未随着变异体数量的增加而提升，这与早期研究中报告的小规模变异体集合检测率更高的结论相矛盾。
MR7和MR9在MuJava和Major工具上均表现出最高的变异体杀死率，表明它们是在测试关系中最为有效的。
MuJava生成的变异体整体杀死率较高（43.6%），高于Major（35.1%），但就大多数单个MR而言，Major生成的变异体更易被杀死，表明MR7在检测中占主导地位。
改变用作源测试用例的随机生成数据集的大小，对MRs的故障检测有效性无显著影响。
结果表明，基于用户期望的MRs不足以实现可靠的故障检测，需要开发基于算法特性的更有效MRs。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。