QUICK REVIEW

[论文解读] A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

Yan Wang, Xuelei Sherry Ni|arXiv (Cornell University)|Jan 24, 2019

Imbalanced Data Classification Techniques参考文献 32被引用 23

一句话总结

本文提出了一种基于XGBoost的业务风险分类模型，通过特征选择和贝叶斯超参数优化进行增强。采用层次聚类进行特征选择，使用树状结构帕尔森估计器（TPE）进行超参数调优，该模型在准确率、AUC、召回率和F1分数方面显著优于逻辑回归，方差更低，且通过特征重要性排序提升了可解释性。

ABSTRACT

This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.

研究动机与目标

开发一种稳健的基于XGBoost的业务分类风险模型。
评估多种特征选择方法对模型性能的影响。
比较随机搜索与贝叶斯优化（TPE）在XGBoost超参数调优中的表现。
使用标准分类指标将XGBoost与传统的逻辑回归进行基准对比。
通过特征重要性排序提升模型的可解释性。

提出的方法

应用五种特征选择方法：基尼重要性、卡方检验、层次聚类、基于相关性的方法以及信息增益。
采用两种超参数优化技术：随机搜索（RS）和基于树状结构帕尔森估计器（TPE）的贝叶斯优化。
使用10折交叉验证训练XGBoost模型，以确保性能估计的稳健性。
通过分类准确率、AUC、召回率和F1分数评估模型性能。
采用Wilcoxon符号秩检验评估性能差异的统计显著性。
基于XGBoost的特征重要性对特征进行排序，以提升模型可解释性。

实验结果

研究问题

RQ1在业务风险建模中，哪种特征选择方法在XGBoost中表现最佳？
RQ2随机搜索与贝叶斯优化（TPE）在XGBoost超参数调优中的表现如何比较？
RQ3经过超参数优化和特征选择后的XGBoost是否在风险分类中优于逻辑回归？
RQ4不同优化策略与特征选择方法组合下的模型性能变异性如何？
RQ5XGBoost能否为业务风险决策提供可解释的特征重要性排序？

主要发现

对于逻辑回归，层次聚类是最佳的特征选择方法；而在XGBoost中，卡方加权法表现最优。
XGBoost中采用TPE和随机搜索的超参数优化在所有指标上均显著优于逻辑回归。
与随机搜索相比，TPE优化在准确率上显著更高，AUC、召回率和F1分数则略高。
采用TPE调优的XGBoost模型性能变异性低于随机搜索方法。
XGBoost提供的特征重要性排序增强了模型的可解释性，支持实际风险评估。
经贝叶斯TPE超参数优化的XGBoost模型是业务风险建模中一种强大、稳健且可解释的替代逻辑回归的方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。