[Paper Review] Regularization Learning Networks: Deep Learning for Tabular Datasets
This paper introduces Regularization Learning Networks (RLNs), a deep learning framework that assigns individual regularization coefficients to each weight in a neural network, enabling improved performance on tabular datasets where feature importance varies widely. By optimizing these coefficients via a novel Counterfactual Loss during training—without requiring a validation set—RLNs achieve performance comparable to Gradient Boosting Trees, produce highly sparse and interpretable models, and significantly outperform standard DNNs on tabular data.
Despite their impressive performance, Deep Neural Networks (DNNs) typically underperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning tasks. We propose that applying a different regularization coefficient to each weight might boost the performance of DNNs by allowing them to make more use of the more relevant inputs. However, this will lead to an intractable number of hyperparameters. Here, we introduce Regularization Learning Networks (RLNs), which overcome this challenge by introducing an efficient hyperparameter tuning scheme which minimizes a new Counterfactual Loss. Our results show that RLNs significantly improve DNNs on tabular datasets, and achieve comparable results to GBTs, with the best performance achieved with an ensemble that combines GBTs and RLNs. RLNs produce extremely sparse networks, eliminating up to 99.8% of the network edges and 82% of the input features, thus providing more interpretable models and reveal the importance that the network assigns to different inputs. RLNs could efficiently learn a single network in datasets that comprise both tabular and unstructured data, such as in the setting of medical imaging accompanied by electronic health records. An open source implementation of RLN can be found at https://github.com/irashavitt/regularization_learning_networks.
Motivation & Objective
- To address the underperformance of Deep Neural Networks (DNNs) on tabular datasets compared to Gradient Boosting Trees (GBTs), particularly due to high variability in input feature importance.
- To investigate whether assigning a unique regularization coefficient to each weight can improve DNN performance on non-distributed representations like those in tabular data.
- To develop an efficient hyperparameter tuning method that avoids the intractable complexity of tuning millions of individual regularization coefficients.
- To enable joint learning on mixed-data tasks, such as combining tabular electronic health records with unstructured data like medical images.
- To produce sparse, interpretable models that reveal meaningful feature importance and support feature selection.
Proposed method
- Introduce a new loss function called the Counterfactual Loss ($\mathcal{L}_{CF}$) to jointly optimize regularization coefficients and network weights during training.
- Optimize regularization coefficients in log space and apply a projection after each update to prevent coefficient vanishing.
- Eliminate the need for a separate validation set by using the Counterfactual Loss to guide hyperparameter tuning directly during backpropagation.
- Assign a unique regularization coefficient to every weight in the network, enabling modular regularization that adapts to feature importance variability.
- Train the network end-to-end with both weights and regularization coefficients updated simultaneously using gradient-based optimization.
- Apply sparsity constraints post-training, resulting in networks that eliminate up to 99.8% of edges and 82% of input features, enhancing interpretability.
Experimental results
Research questions
- RQ1Can assigning individual regularization coefficients to each weight improve DNN performance on tabular datasets with highly variable input feature importance?
- RQ2Is it possible to efficiently optimize millions of regularization coefficients without relying on a validation set or derivative-free hyperparameter tuning?
- RQ3How does the Counterfactual Loss enable effective joint optimization of weights and regularization coefficients in deep networks?
- RQ4To what extent do RLNs produce sparse, interpretable models that reflect true feature importance in tabular data?
- RQ5Can RLNs be effectively combined with GBTs in an ensemble to achieve state-of-the-art performance on tabular prediction tasks?
Key findings
- RLNs significantly improve DNN performance on tabular datasets, increasing explained variance by a factor of 2.75±0.05 compared to standard DNNs.
- RLNs achieve performance comparable to Gradient Boosting Trees (GBTs), particularly excelling in settings with high variability in input feature importance.
- The ensemble of RLNs and GBTs outperforms all other ensembles on 3 out of 4 traits and achieves state-of-the-art results on all but one trait in the microbiome prediction task.
- RLNs produce extremely sparse networks, eliminating up to 99.8% of network edges and 82% of input features, with sparsity achieved within the first 10–20 training epochs.
- Feature importance derived from RLNs has a Jensen-Shannon divergence 48%±1% lower than DNNs and 54%±2% lower than LMs across model instantiations, indicating higher consistency and interpretability.
- The entropy of feature importance in RLNs is 4.6 bits, compared to 9.5 bits in DNNs, indicating more meaningful and non-uniform feature importance distribution.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.