QUICK REVIEW

[论文解读] Enhancing Robustness of Gradient-Boosted Decision Trees through One-Hot Encoding and Regularization

Shijie Cui, Agus Sudjianto|arXiv (Cornell University)|Apr 26, 2023

Advanced Statistical Methods and Models被引用 9

一句话总结

论文通过对叶子进行单热(one-hot)编码，将GBDT转化为线性模型，并以L1/L2正则化重新拟合以提高对协变量扰动的鲁棒性，理论与实验予以支撑。

ABSTRACT

Gradient-boosted decision trees (GBDT) are widely used and highly effective machine learning approach for tabular data modeling. However, their complex structure may lead to low robustness against small covariate perturbation in unseen data. In this study, we apply one-hot encoding to convert a GBDT model into a linear framework, through encoding of each tree leaf to one dummy variable. This allows for the use of linear regression techniques, plus a novel risk decomposition for assessing the robustness of a GBDT model against covariate perturbations. We propose to enhance the robustness of GBDT models by refitting their linear regression forms with $L_1$ or $L_2$ regularization. Theoretical results are obtained about the effect of regularization on the model performance and robustness. It is demonstrated through numerical experiments that the proposed regularization approach can enhance the robustness of the one-hot-encoded GBDT models.

研究动机与目标

为GBDT的鲁棒性评估提供动机并识别在协变量扰动下的脆弱性。
提出一种单热编码框架（GBDT OHE），将GBDT表示为一个线性模型。
在单热编码形式上进行带正则化的再拟合（L1/L2）以提高鲁棒性。
开发一个风险分解工具，以分析GBDT中的扰动并量化鲁棒性。
提供理论结果和数值证据，展示GBDT OHE中正则化带来的鲁棒性提升。

提出的方法

通过将每个树的叶子编码为一个哑变量来将GBDT表示为一个线性模型（GBDT OHE）。
将F_M(x)表示为F_M(x)=sum_k b_k phi_k(x) = Phi(x)^T beta，从而实现线性回归拟合。
引入一个扰动项Delta Phi，将风险分解为偏差、方差和扰动分量。
在OHE后对叶子系数进行L1（Lasso）或L2（Ridge）正则化以控制高维方差的扩张。
提供理论联系，显示在鲁棒回归和正则化下的鲁棒性收益（定理1）。
在真实数据集（Airfoil、CHP等）上进行数值实验，比较XGBoost基线与OHE+正则化的表现。

实验结果

研究问题

RQ1传统的GBDT模型在未知数据的微小扰动下有多鲁棒？
RQ2GBDT叶子的一次性编码是否能使鲁棒性分析进入线性框架？
RQ3对OHE_GBDT进行L1或L2正则化的再拟合是否在不显著牺牲性能的前提下提升鲁棒性？
RQ4正则化规模对GBDT OHE中的偏差、方差和扰动项有何影响？
RQ5在协变量扰动下，正则化的GBDT OHE模型是否优于标准XGBoost？

主要发现

Model	Airfoil(0%)	Airfoil(2%)	Airfoil(5%)	CHP(0%)	CHP(2%)	CHP(5%)	BS(0%)	BS(5%)	BS(10%)
XGB	0.032/0	0.074/0.046	0.156/0.134	0.154/0	0.202/0.058	0.324/0.180	0.159/0	0.215/0.059	0.349/0.198
XGB_reg	0.033/0	0.071/0.039	0.153/0.124	0.155/0	0.201/0.053	0.316/0.168	0.160/0	0.212/0.057	0.343/0.196
OHE_Ridge_s	0.020/0	0.053/0.032	0.120/0.099	0.151/0	0.199/0.057	0.318/0.173	0.158/0	0.212/0.057	0.345/0.197
OHE_Ridge_m	0.021/0	0.052/0.029	0.119/0.092	0.155/0	0.194/0.039	0.295/0.131	0.155/0	0.194/0.039	0.343/0.187
OHE_Ridge_l	0.029/0	0.054/0.025	0.117/0.083	0.170/0	0.201/0.028	0.287/0.102	0.161/0	0.213/0.051	0.342/0.181
OHE_Lasso_s	0.022/0	0.058/0.036	0.125/0.100	0.151/0	0.205/0.062	0.331/0.186	0.159/0	0.213/0.059	0.349/0.200
OHE_Lasso_m	0.025/0	0.055/0.033	0.121/0.098	0.153/0	0.201/0.053	0.317/0.164	0.158/0	0.212/0.056	0.346/0.196
OHE_Lasso_l	0.026/0	0.056/0.031	0.120/0.096	0.179/0	0.211/0.039	0.305/0.125	0.159/0	0.213/0.055	0.346/0.193

GBDT模型在提升复杂度增加时可能表现出鲁棒性下降的趋势，可从对扰动敏感的风险分解看出。
对GBDT叶子进行一次热编码得到线性表示（GBDT OHE），并实现了一种新的鲁棒性风险分解。
用L1或L2惩罚对再拟合的线性形式进行正则化，可以降低扰动项并提升鲁棒性，偏差/方差之间存在权衡。
数值结果显示，在无扰动下，GBDT OHE与小规模正则化的结合往往达到与基线相当或更好的性能；在有扰动时，鲁棒性得到提升。
较大的正则化通常降低扰动效应并增强对较大未见数据扰动的鲁棒性，但可能带来更高偏差的代价。
相比带正则化的XGBoost基线（XGB_reg），带正则化的GBDT OHE在数据扰动下通常提供更好的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。