QUICK REVIEW

[论文解读] Ensemble of Example-Dependent Cost-Sensitive Decision Trees

Alejandro Correa Bahnsen, Djamila Aouada|arXiv (Cornell University)|May 18, 2015

Imbalanced Data Classification Techniques参考文献 36被引用 25

一句话总结

本文提出了一种示例依赖型代价敏感决策树（ECSDT）的集成框架，通过使用装袋法、重采样法、随机森林或随机补丁法在随机子样本上训练多个代价敏感树，并结合多数投票、代价敏感加权投票或代价敏感堆叠策略，实现金融节省的提升。该方法在五个真实世界数据集上优于最先进技术，其中随机补丁法与代价敏感加权投票组合效果最佳。

ABSTRACT

Several real-world classification problems are example-dependent cost-sensitive in nature, where the costs due to misclassification vary between examples and not only within classes. However, standard classification methods do not take these costs into account, and assume a constant cost of misclassification errors. In previous works, some methods that take into account the financial costs into the training of different algorithms have been proposed, with the example-dependent cost-sensitive decision tree algorithm being the one that gives the highest savings. In this paper we propose a new framework of ensembles of example-dependent cost-sensitive decision-trees. The framework consists in creating different example-dependent cost-sensitive decision trees on random subsamples of the training set, and then combining them using three different combination approaches. Moreover, we propose two new cost-sensitive combination approaches; cost-sensitive weighted voting and cost-sensitive stacking, the latter being based on the cost-sensitive logistic regression method. Finally, using five different databases, from four real-world applications: credit card fraud detection, churn modeling, credit scoring and direct marketing, we evaluate the proposed method against state-of-the-art example-dependent cost-sensitive techniques, namely, cost-proportionate sampling, Bayes minimum risk and cost-sensitive decision trees. The results show that the proposed algorithms have better results for all databases, in the sense of higher savings.

研究动机与目标

解决传统代价敏感分类器假设每类误分类代价恒定，而非每示例独立代价的局限性。
通过集成学习克服单个代价敏感决策树的高方差。
开发一种将示例依赖代价整合到基础学习器训练与分类器组合中的框架。
证明金融节省（以现实世界代价衡量）在模型选择中优于传统指标如F1分数。
在多样化的现实世界应用中评估该框架，包括信用卡欺诈检测、客户流失建模、信用评分和直接营销。

提出的方法

使用四种训练方法（装袋法、重采样法、随机森林、随机补丁法）在训练数据的随机子样本上训练多个示例依赖型代价敏感决策树（ECSDT）。
在树构建过程中应用代价敏感分裂准则，并采用基于代价的剪枝策略以优化金融结果。
使用三种组合策略结合基础分类器：多数投票、代价敏感加权投票（权重基于代价性能计算）和代价敏感堆叠（使用代价敏感逻辑回归作为元学习器）。
采用代价成比例采样，确保训练示例按其个体误分类代价加权。
通过在多个数据集上选择最优的训练方法与组合策略组合，优化集成性能。
以金融节省为主要指标评估性能，F1分数作为次要的、代价无关的基准。

实验结果

研究问题

RQ1与单树模型相比，集成方法是否能在示例依赖型代价敏感分类中提升金融节省？
RQ2在示例依赖代价背景下，哪种训练方法（装袋法、重采样法、随机森林、随机补丁法）能生成最有效的基础分类器？
RQ3在集成预测中，哪种组合策略（多数投票、代价敏感加权投票、代价敏感堆叠）能实现最高的金融节省？
RQ4F1分数排名与金融节省排名在真实世界数据集中的相关性如何？
RQ5在误分类代价可变的业务关键应用中，传统、代价无关的指标（如F1分数）在多大程度上会误导模型选择？

主要发现

所提出的集成框架在所有五个真实世界数据集上，金融节省方面均优于最先进示例依赖型代价敏感方法（包括代价成比例采样、贝叶斯最小风险和标准代价敏感决策树）。
随机补丁法训练方法表现最佳，可能因其复杂度较低，并有效利用了多样化的特征与样本子集。
代价敏感加权投票作为组合策略表现最优，优于多数投票和代价敏感堆叠。
F1分数排名与金融节省排名的相关性仅为65.10%，表明传统指标在代价敏感情境下可能误导模型选择。
基于节省表现最佳的算法并不总是F1分数最高的模型，证实业务导向指标在现实世界部署中至关重要。
该框架的12种不同配置（4种训练方法 × 3种组合策略）表明，训练方法与组合策略的选择对金融结果有显著影响。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。