QUICK REVIEW

[论文解读] Comparing interpretability and explainability for feature selection

Jack Dunn, Luca Mingardi|arXiv (Cornell University)|May 11, 2021

Explainable Artificial Intelligence (XAI)参考文献 9被引用 23

一句话总结

该论文评估了在可解释模型（CART、Optimal Trees）与黑箱模型（XGBoost、SHAP）中变量重要性在特征选择中的表现，发现可解释模型——尤其是Optimal Trees——能更准确地识别无关特征，且对具有更多唯一值的特征无偏见，而XGBoost和SHAP则持续错误分配重要性，导致特征选择性能不佳，尽管其预测准确性较强。

ABSTRACT

A common approach for feature selection is to examine the variable importance scores for a machine learning model, as a way to understand which features are the most relevant for making predictions. Given the significance of feature selection, it is crucial for the calculated importance scores to reflect reality. Falsely overestimating the importance of irrelevant features can lead to false discoveries, while underestimating importance of relevant features may lead us to discard important features, resulting in poor model performance. Additionally, black-box models like XGBoost provide state-of-the art predictive performance, but cannot be easily understood by humans, and thus we rely on variable importance scores or methods for explainability like SHAP to offer insight into their behavior. In this paper, we investigate the performance of variable importance as a feature selection method across various black-box and interpretable machine learning methods. We compare the ability of CART, Optimal Trees, XGBoost and SHAP to correctly identify the relevant subset of variables across a number of experiments. The results show that regardless of whether we use the native variable importance method or SHAP, XGBoost fails to clearly distinguish between relevant and irrelevant features. On the other hand, the interpretable methods are able to correctly and efficiently identify irrelevant features, and thus offer significantly better performance for feature selection.

研究动机与目标

评估变量重要性评分作为不同机器学习模型中特征选择工具的可靠性。
调查XGBoost等黑箱模型以及SHAP等可解释性方法是否能准确反映真实特征的相关性。
评估CART和Optimal Trees等可解释模型是否因特征唯一值数量较多而产生选择偏差。
确定作为贪心树的全局优化替代方案，Optimal Trees是否能提高特征选择的准确性并减少偏差。
在不同数据规模和特征分布下，比较各类模型在特征重要性识别上的收敛速度与准确性。

提出的方法

研究使用具有受控特征分布的合成数据集，包括具有2、4、10和20个唯一值的特征，以诱导选择偏差。
生成了仅使用其中三个特征进行分裂的真值树，从而可精确评估特征相关性。
使用CART、Optimal Trees、XGBoost和SHAP的原生方法计算变量重要性，并对多次运行的结果进行汇总。
通过测量随着训练集规模增大，分配给无关特征的重要性比例来评估性能。
报告了样本外准确率，以确保特征选择性能未因预测准确性而受损。
在无偏（均匀生成的特征）和有偏（四舍五入后具有不同唯一值数量的特征）两种设置下进行实验，以测试对选择偏差的鲁棒性。

实验结果

研究问题

RQ1XGBoost和SHAP的变量重要性评分是否能准确反映合成数据集中特征的真实相关性？
RQ2对具有更多唯一值的特征的选择偏差如何影响CART和XGBoost中的变量重要性？
RQ3可解释模型如Optimal Trees是否能优于黑箱模型来识别无关特征？
RQ4随着训练数据增加，不同模型在多快收敛到正确的特征重要性分配？
RQ5与贪心CART相比，Optimal Trees中使用全局优化是否能减少选择偏差？

主要发现

Optimal Trees（OCT）即使在小样本量下也始终将接近零的重要性分配给无关特征，且收敛速度优于其他模型。
XGBoost和SHAP无法区分相关与无关特征，特别是在高噪声环境下，会为无关特征分配显著的重要性。
CART在有偏设置下表现出更高的变异性与更慢的收敛速度，表明其易受唯一值数量影响而产生选择偏差。
尽管预测准确性更高，XGBoost的变量重要性评分在特征选择中不可靠，因其错误地将重要性分配给无关特征。
SHAP虽理论上可缓解偏差，但仍无法正确识别无关特征，表明其在实际特征选择中的实用性存在局限。
Optimal Trees在预测性能上与XGBoost相当，同时在特征选择准确性上表现更优，且偏差更小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。