QUICK REVIEW

[论文解读] The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models

Chakkrit Tantithamthavorn, Ahmed E. Hassan|arXiv (Cornell University)|Jan 31, 2018

Software Engineering Research被引用 37

一句话总结

本研究调查了四种类别重平衡技术——过采样、欠采样、SMOTE 和 ROSE——在 101 个软件数据集上的缺陷预测模型中的影响。研究发现，重平衡显著提高了召回率，但损害了模型的可解释性，而 AUC 保持不变。因此作者建议将 AUC 作为标准性能度量指标，并在从模型中提取可操作洞察时谨慎使用重平衡技术。

ABSTRACT

Defect prediction models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect prediction models. Prior research efforts arrive at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect prediction models. In this paper, we investigate the impact of 4 popularly-used class rebalancing techniques on 10 commonly-used performance measures and the interpretation of defect prediction models. We also construct statistical models to better understand in which experimental design settings that class rebalancing techniques are beneficial for defect prediction models. Through a case study of 101 datasets that span across proprietary and open-source systems, we recommend that class rebalancing techniques are necessary when quality assurance teams wish to increase the completeness of identifying software defects (i.e., Recall). However, class rebalancing techniques should be avoided when interpreting defect prediction models. We also find that class rebalancing techniques do not impact the AUC measure. Hence, AUC should be used as a standard measure when comparing defect prediction models.

研究动机与目标

通过大规模实证研究解决先前关于缺陷预测中类别重平衡研究的矛盾发现。
检查重平衡如何影响不同数据集和分类器的缺陷预测模型的性能与可解释性。
识别重平衡带来可测量优势或劣势的实验条件。
为从业者和研究人员提供何时以及如何应用重平衡技术的可操作指南。

提出的方法

本研究在来自开源和专有系统的 101 个缺陷预测数据集中评估了四种重平衡技术：过采样、欠采样、SMOTE 和 ROSE。
使用七种分类算法训练缺陷预测模型：随机森林、逻辑回归、朴素贝叶斯、AVNNet、C5.0、xGBTree 和 GBM。
性能通过 10 项指标进行衡量：三项与阈值无关（如 AUC）和七项与阈值相关（如精确率、召回率、F1 值）。
构建统计模型以分析实验设置（如数据集不平衡程度、维度）与模型性能/可解释性之间的关系。
SMOTE 参数 k 在 k=5 和 k=14 时进行测试，以评估其敏感性，结果未发现显著差异。
通过比较基线模型和重平衡模型中排名靠前的特征来评估可解释性，以检测概念漂移。

实验结果

研究问题

RQ1不同类别重平衡技术如何影响各种数据集和分类器的缺陷预测模型性能？
RQ2类别重平衡在多大程度上影响缺陷预测模型的可解释性，特别是特征重要性变化？
RQ3哪些性能度量对重平衡敏感，哪些保持不变，尤其是 AUC？
RQ4在何种实验条件下（如数据集不平衡程度、维度）重平衡最有利于提高召回率？
RQ5分类算法的选择如何调节重平衡对性能和可解释性的影响？

主要发现

仅有 8% 的缺陷预测数据集的缺陷比例在 45% 至 55% 之间，表明现实世界软件系统中普遍存在类别不平衡。
AUC 度量不受四种重平衡技术中的任何一种显著影响，这与通用机器学习文献中的发现相矛盾。
类别重平衡技术最显著地提升了召回率，但最显著地降低了精确率，表明完整性和准确性之间存在权衡。
当在高度不平衡且低维度的数据集上使用逻辑回归进行欠采样时，性能提升最大。
可解释性显著下降：使用神经网络的重平衡模型中，仅有 23%–34% 的排名靠前特征与基线模型重叠，逻辑回归模型中为 55%–62%。
对于随机森林模型，重平衡模型中 68%–71% 的排名靠前特征未出现在基线模型的前排，表明存在显著的概念漂移。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。