[Paper Review] Symbolic regression outperforms other models for small data sets
The study shows that on small training sets of 250 observations, QLattice-based symbolic regression generalises better to out-of-sample data than linear models, trees, random forests, and gradient boosting, with 132 of 240 cases where it outperformed all others. It also maintains interpretability.
Machine learning is often applied in health science to obtain predictions and new understandings of complex phenomena and relationships, but an availability of sufficient data for model training is a widespread problem. Traditional machine learning techniques, such as random forests and gradient boosting, tend to overfit when working with data sets of only a few hundred observations. This study demonstrates that for small training sets of 250 observations, symbolic regression generalises better to out-of-sample data than traditional machine learning frameworks, as measured by the coefficient of determination R2 on the validation set. In 132 out of 240 cases, symbolic regression achieves a higher R2 than any of the other models on the out-of-sample data. Furthermore, symbolic regression also preserves the interpretability of linear models and decision trees, an added benefit to its superior generalisation. The second best algorithm was found to be a random forest, which performs best in 37 of the 240 cases. When restricting the comparison to interpretable models, symbolic regression performs best in 184 out of 240 cases.
Motivation & Objective
- Motivate the challenge of modelling with small datasets in health sciences.
- Evaluate the generalisation performance of symbolic regression versus traditional models on small train sets.
- Assess interpretability trade-offs between symbolic regression and other methods.
Proposed method
- Compare QLattice symbolic regression to linear regression, decision trees, random forests, and gradient boosting using 250-sample training and out-of-sample evaluation on 48 PMLB regression data sets.
- Use R^2 on the out-of-sample validation set as the primary generalisation metric.
- Sample 5 different 250-observation training sets per data set to assess robustness across data splits.
- Configure models with typical hyperparameters as listed in Table 1, including two QLattice criteria (AIC, BIC) and max_edges constraint.
- Report first-place counts and weighted scores across 240 model-data set runs.
Experimental results
Research questions
- RQ1Does symbolic regression generalise better to out-of-sample data than conventional models when training data are scarce?
- RQ2How does interpretability of symbolic regression compare with linear models and decision trees in small-data regimes?
Key findings
- Symbolic regression (QLattice) outperformed all other models in 132 of 240 cases under the best-configuration comparison.
- Across all 240 cases, QLattice with BIC sorting achieved the highest average performance (First places: 77; Weighted score: 644; Best-first: 132; Best-weighted: 1033).
- When restricting to the five best configurations across technologies, QLattice (BIC) led with 132 first places and the highest weighted score (1033).
- The second-best overall were gradient boosting and random forests, but they generally lagged behind symbolic regression on out-of-sample generalisation.
- Among interpretable models, symbolic regression was the best in 184 of 240 cases (vs. 49 for Lasso and 7 for simple decision trees).
- Simple models (e.g., decision trees) tended to generalise better than ensembles on these small datasets, with symbolic regression striking a balance between learning and generalisation.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.