QUICK REVIEW

[論文レビュー] A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

Yan Wang, Xuelei Sherry Ni|arXiv (Cornell University)|Jan 24, 2019

Imbalanced Data Classification Techniques参考文献 32被引用数 23

ひとこと要約

本稿では、特徴量選択とベイジアン超パrameter最適化を組み合わせたXGBoostベースのビジネスリスク分類モデルを提案する。特徴量選択には階層的クラスタリングを、超パrameterチューニングにはツリー構造付きパルゼン推定（TPE）を用い、ロジスティック回帰と比較して、正解率、AUC、再現率、F1スコアにおいて顕著に優れた性能を示し、分散が低く、特徴量の重要度順位付けによる解釈性も向上した。

ABSTRACT

This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.

研究の動機と目的

ビジネス分類のための頑健なXGBoostベースのリスクモデルの開発。
複数の特徴量選択手法がモデル性能に与える影響の評価。
XGBoostにおける超パrameterチューニングにおいて、ランダムサーチとベイジアン最適化（TPE）の比較。
標準的な分類指標を用いて、XGBoostと従来のロジスティック回帰のベンチマーク比較。
特徴量の重要度順位付けによるモデルの解釈性の向上。

提案手法

Giniの重要度、カイ二乗検定、階層的クラスタリング、相関に基づく手法、情報ゲインの5つの特徴量選択手法を適用。
ランダムサーチ（RS）とツリー構造付きパルゼン推定（TPE）によるベイジアン最適化の2つの超パrameter最適化手法を活用。
10分割交差検証を用いてXGBoostモデルを訓練し、頑健な性能推定を確保。
分類正解率、AUC、再現率、F1スコアを用いてモデル性能を評価。
性能差の統計的有意性を評価するためにウィルコクソン符号順位検定を用いた。
XGBoostの特徴量の重要度に基づいて順位付けを行い、モデルの解釈性を向上させた。

実験結果

リサーチクエスチョン

RQ1ビジネスリスクモデリングにおけるXGBoostにおいて、どの特徴量選択手法が最良のパフォーマンスをもたらすか？
RQ2XGBoostの超パrameterチューニングにおいて、ランダムサーチとベイジアン最適化（TPE）はどのように比較されるか？
RQ3最適化された超パrameterと特徴量選択を施したXGBoostは、ロジスティック回帰を上回るリスク分類性能を示すか？
RQ4異なる最適化および特徴量選択戦略におけるモデルパフォーマンスのばらつきはどの程度か？
RQ5XGBoostは、ビジネスリスク意思決定における解釈可能な特徴量の重要度順位付けを提供できるか？

主な発見

ロジスティック回帰では階層的クラスタリングが最適な特徴量選択手法であったが、XGBoostではカイ二乗検定による重み付けが最も優れた結果をもたらした。
XGBoostにおけるTPEおよびランダムサーチの両方の超パrameter最適化手法が、すべての指標でロジスティック回帰を顕著に上回った。
TPE最適化はランダムサーチに比べ、有意に高い正解率とわずかに高いAUC、再現率、F1スコアを達成した。
TPEチューニングを施したXGBoostは、ランダムサーチ手法よりもパフォーマンスのばらつきが低かった。
XGBoostによる特徴量の重要度順位付けは、モデルの解釈性を向上させ、実践的なリスク評価を支援した。
ベイジアンTPEによる超パrameter最適化を施したXGBoostモデルは、ビジネスリスクモデリングにおいて強力で、頑健かつ解釈可能なロジスティック回帰の代替手段である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。