[论文解读] Combinatorial Testing for Deep Learning Systems
本论文探索将组合测试(CT)应用于深度学习(DL)系统,提出面向 DL 的 CT 覆盖标准与CT 指导的测试生成方法,用以评估局部鲁棒性和对抗性脆弱性。
Deep learning (DL) has achieved remarkable progress over the past decade and been widely applied to many safety-critical applications. However, the robustness of DL systems recently receives great concerns, such as adversarial examples against computer vision systems, which could potentially result in severe consequences. Adopting testing techniques could help to evaluate the robustness of a DL system and therefore detect vulnerabilities at an early stage. The main challenge of testing such systems is that its runtime state space is too large: if we view each neuron as a runtime state for DL, then a DL system often contains massive states, rendering testing each state almost impossible. For traditional software, combinatorial testing (CT) is an effective testing technique to reduce the testing space while obtaining relatively high defect detection abilities. In this paper, we perform an exploratory study of CT on DL systems. We adapt the concept in CT and propose a set of coverage criteria for DL systems, as well as a CT coverage guided test generation technique. Our evaluation demonstrates that CT provides a promising avenue for testing DL systems. We further pose several open questions and interesting directions for combinatorial testing of DL systems.
研究动机与目标
- 由于在安全关键应用中的鲁棒性担忧(如对抗样本)而推动对 DL 系统的测试。
- 通过将 neuron-activation 基于神经元输出按 0 分割来定义 CT 标准,使其适用于 DL。
- 提出一种 CT 指导的测试生成技术,以系统性覆盖 DL 层中的 CT 目标。
- 通过在 MNIST 模型上的经验评估,演示 CT 在鲁棒性测试中的有用性。
提出的方法
- 基于神经元输出按 0 拆分来定义神经元激活配置。
- 在同一层的神经元集合内引入 2Way 稀疏覆盖与密集覆盖的组合。
- 将 CT 推广到 (p, t)-完全覆盖,以量化层级范围的 CT 覆盖。
- 开发 CT Coverage Guided TestGen 算法,利用受限的测试生成(本研究中基于 LP)在 DL 层之间迭代覆盖 CT 目标。
- 使用 Keras/TensorFlow 实现 DeepCT 框架,并采用线性规划(CPLEX)进行测试生成。
实验结果
研究问题
- RQ1CT 概念能否改编到 DL 以在不降低鲁棒性检测能力的前提下减少测试空间?
- RQ2DL 特定的 CT 覆盖标准是否能有效引导测试生成,揭示局部鲁棒性问题和对抗性样本?
- RQ3与随机测试相比,基于 CT 的测试在覆盖率和对 DL 模型的对抗性检测方面有何不同?
主要发现
| Testing Method | 2-Way Sparse Coverage | 2-Way Dense Coverage | (0.5,2)-Completeness | (0.75,2)-Completeness | Tests | Adversarial Test Ratio (%) |
|---|---|---|---|---|---|---|
| DNN 1 Random | 2.28 | 34.95 | 33.75 | 3.75 | 10,000 | 0.00 |
| CT L1 | 60.27 | 81.56 | 95.01 | 70.98 | 4,073 | 0.29 |
| CT L2 | 76.94 | 91.98 | 99.67 | 91.30 | 6,768 | 2.17 |
| CT L3 | 93.62 | 98.23 | 100.00 | 99.32 | 8,032 | 9.91 |
| DNN 2 Random | 1.18 | 32.56 | 26.98 | 2.10 | 10,000 | 0.00 |
| CT L1 | 46.96 | 75.10 | 91.95 | 61.50 | 8,547 | 1.87 |
| CT L2 | 68.91 | 87.52 | 98.64 | 82.55 | 11,573 | 3.53 |
| CT L3 | 97.15 | 99.05 | 100.0 | 99.03 | 13,129 | 8.84 |
| CT L4 | 97.41 | 99.11 | 100.0 | 99.03 | 13,217 | 9.35 |
| CT L5 | 97.81 | 99.21 | 100.0 | 99.03 | 13,351 | 9.98 |
- CT 覆盖标准在分析层时实现了较高的 2 方式覆盖,优于随机测试。
- 在 MNIST 的 DNN 上,CT 基于测试在较深层达到最高 2Way 稀疏覆盖 97.81% 与 2Way 密集覆盖 99.21%,测试数量相比随机测试显著减少(约 4k–13k 个测试)。
- CT 基于的测试在发现对抗性样本方面比随机测试更有效,特别是在覆盖前几层(L1–L3)时。
- 随机测试显示较有限的 2Way 覆盖率(如 DNN1 的稀疏 2.28%)和较弱的完备性,而 DeepCT 在相同或更少测试量下实现更高覆盖率。
- CT 指导表明不同层对鲁棒性检测的贡献不同,建议对每层进行聚焦的 CT 目标。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。