QUICK REVIEW

[论文解读] DeepMutation: Mutation Testing of Deep Learning Systems

Lei Ma, Fuyuan Zhang|arXiv (Cornell University)|May 14, 2018

Software Testing and Debugging Techniques被引用 29

一句话总结

该论文提出 DeepMutation，一种用于深度学习系统的变异测试框架，通过在训练数据、训练程序和训练好的模型中使用专门的变异算子注入故障，来评估测试数据质量。该框架表明，更高的变异体检测率与更好的测试套件质量相关，显示出在 MNIST 和 CIFAR-10 数据集上使用多种模型时，提升深度学习系统鲁棒性和可靠性的潜力。

ABSTRACT

Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using an inadequate test dataset, DL models that have achieved high test accuracy may still lack generality and robustness. In traditional software testing, mutation testing is a well-established technique for quality evaluation of test suites, which analyzes to what extent a test suite detects the injected faults. However, due to the fundamental difference between traditional software and deep learning-based software, traditional mutation testing techniques cannot be directly applied to DL systems. In this paper, we propose a mutation testing framework specialized for DL systems to measure the quality of test data. To do this, by sharing the same spirit of mutation testing in traditional software, we first define a set of source-level mutation operators to inject faults to the source of DL (i.e., training data and training programs). Then we design a set of model-level mutation operators that directly inject faults into DL models without a training process. Eventually, the quality of test data could be evaluated from the analysis on to what extent the injected faults could be detected. The usefulness of the proposed mutation testing techniques is demonstrated on two public datasets, namely MNIST and CIFAR-10, with three DL models.

研究动机与目标

为解决深度学习系统中测试数据质量缺乏系统性评估技术的问题。
识别测试套件中的关键弱点，这些弱点可能导致尽管准确率高，仍会遗漏模型漏洞。
将传统软件中已验证有效的变异测试方法，适配到深度学习系统的独特特性中。
设计并验证能够模拟训练数据、训练代码和训练模型中真实故障的变异算子。
提供一个可扩展且高效的框架，用于衡量测试套件在检测模型级故障方面的有效性。

提出的方法

提出八种源代码级变异算子，针对训练数据和训练程序，以模拟数据收集和实现中的常见缺陷。
设计八种模型级变异算子，直接修改模型权重和结构而无需重新训练，从而实现快速生成变异体。
应用标准的变异测试度量指标（如变异体杀死率和变异分数）来评估测试套件的有效性。
在测试集上执行每个变异体模型，并将输出与原始模型进行比较，以检测行为差异。
将不一致输出的检测结果作为测试套件在发现故障方面有效性的证据。
采用混合方法，结合源代码级和模型级变异，以覆盖多种类型的故障并提升测试覆盖率。

实验结果

研究问题

RQ1变异测试在多大程度上能够检测到注入到深度学习系统训练数据和训练程序中的故障？
RQ2模型级变异算子在无需重新训练的情况下，揭示测试套件弱点的有效性如何？
RQ3变异体检测率能否作为评估深度学习中测试数据质量的可靠指标？
RQ4不同变异算子在暴露模型漏洞方面的能力如何比较？
RQ5所提出的框架能否识别出高准确率基准可能遗漏的测试套件缺陷？

主要发现

所提出的源代码级变异算子成功地在训练数据和程序中注入了现实的故障，模拟了常见的数据和编码错误。
模型级变异算子实现了高效的大规模变异体集合生成，揭示了源代码级变异可能遗漏的细粒度模型级问题。
在 MNIST 和 CIFAR-10 上，变异体杀死率更高的测试套件表现出更好的鲁棒性和泛化能力，即使准确率很高。
变异分数指标与测试套件质量具有强相关性，表明其作为测试数据评估的定量指标具有潜力。
该框架揭示出，即使准确率高，模型在面对微小扰动时仍可能表现脆弱，凸显了测试套件的局限性。
评估结果证实，变异测试是评估深度学习系统中测试数据质量的一种可行且有效的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。