QUICK REVIEW

[论文解读] Symbolic regression outperforms other models for small data sets

Casper Wilstrup, Jaan Kasak|arXiv (Cornell University)|Mar 28, 2021

Explainable Artificial Intelligence (XAI)参考文献 22被引用 27

一句话总结

本研究在250个观测值的小型训练集上表明，基于QLattice的符号回归在对样本外数据的泛化方面优于线性模型、决策树、随机森林和梯度提升，在240个案例中有132个案例其表现优于所有其他方法，同时保持可解释性。

ABSTRACT

Machine learning is often applied in health science to obtain predictions and new understandings of complex phenomena and relationships, but an availability of sufficient data for model training is a widespread problem. Traditional machine learning techniques, such as random forests and gradient boosting, tend to overfit when working with data sets of only a few hundred observations. This study demonstrates that for small training sets of 250 observations, symbolic regression generalises better to out-of-sample data than traditional machine learning frameworks, as measured by the coefficient of determination R2 on the validation set. In 132 out of 240 cases, symbolic regression achieves a higher R2 than any of the other models on the out-of-sample data. Furthermore, symbolic regression also preserves the interpretability of linear models and decision trees, an added benefit to its superior generalisation. The second best algorithm was found to be a random forest, which performs best in 37 of the 240 cases. When restricting the comparison to interpretable models, symbolic regression performs best in 184 out of 240 cases.

研究动机与目标

强调在健康科学中使用小数据集进行建模的挑战。
评估符号回归在小型训练集上相对于传统模型的泛化性能。
评估符号回归与其他方法在可解释性方面的权衡。

提出的方法

将QLattice符号回归与线性回归、决策树、随机森林和梯度提升进行比较，使用250样本训练并在48个PMLB回归数据集上进行样本外评估。
以样本外验证集上的R^2作为主要泛化指标。
对每个数据集采样5个不同的250观测值训练子集，以评估数据划分对稳健性的影响。
按照表1所列的典型超参数配置模型，其中包含两种QLattice准则（AIC、BIC）以及max_edges约束。
报告在240次模型-数据集运行中的第一名计数和加权分数。

实验结果

研究问题

RQ1在训练数据稀缺时，符号回归是否对样本外数据有更好的泛化？
RQ2在小数据情境下，符号回归的可解释性与线性模型和决策树相比如何？

主要发现

Model	First places	Weighted scoring	First places for best	Weighted scoring for best
QLattice(criterion="bic", max_edges=11)	77	644	132	1033
QLattice(criterion="aic", max_edges=11)	65	608
Lasso(alpha=0.1, max_iter=100000)	18	404	32	511
GradientBoostingRegressor(n_estimators=400)	12	375	36	821
RandomForestRegressor(n_estimators=400)	10	268	37	787
LinearRegression()	9	170
GradientBoostingRegressor(n_estimators=50)	8	166
GradientBoostingRegressor(n_estimators=200)	7	160
GradientBoostingRegressor(n_estimators=100)	7	158
Lasso(alpha=0.01, max_iter=100000)	7	133
RandomForestRegressor(n_estimators=50)	5	128
RandomForestRegressor()	5	124
RandomForestRegressor(n_estimators=200)	4	124
DecisionTreeRegressor(max_depth=2)	3	88	3	448
Lasso(alpha=0.05, max_iter=100000)	2	25
DecisionTreeRegressor(max_depth=1)	1	20
DecisionTreeRegressor(max_depth=6)	0	4
DecisionTreeRegressor(max_depth=4)	0	1

在最佳配置比较下，符号回归（QLattice）在240个案例中有132个案例压过所有其他模型。
在全部240个案例中，采用BIC排序的QLattice达到了最高的平均性能（第一名：77；加权得分：644；最佳-优先：132；最佳-加权：1033）。
在跨技术的五个最佳配置中，QLattice（BIC）以132个第一名和最高加权得分1033领先。
第二名整体是梯度提升和随机森林，但在样本外泛化方面通常落后于符号回归。
在可解释模型中，符号回归在240个案例中有184个为最佳（Lasso 49个，简单决策树7个）。
简单模型（如决策树）在这些小数据集上通常比集成方法有更好的泛化，符号回归在学习能力和泛化之间取得了平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。