[Paper Review] Stochastic Approximation EM for Logistic Regression with Missing Values
This paper proposes a stochastic approximation EM algorithm using Metropolis-Hastings sampling for logistic regression with missing data, enabling parameter estimation, variance inference, confidence intervals, model selection, and prediction on incomplete test sets. The method is computationally efficient and demonstrates strong coverage and variable selection performance in simulations and a real-world trauma dataset.
Logistic regression is a common classification method in supervised learning. Surprisingly, there are very few solutions for performing it and selecting variables in the presence of missing values. We propose a stochastic approximation version of the EM algorithm based on Metropolis-Hasting sampling, to perform statistical inference for logistic regression with incomplete data. We propose a complete approach, including the estimation of parameters and their variance, derivation of confidence intervals, a model selection procedure, and a method for prediction on test sets with missing values. The method is computationally efficient, and its good coverage and variable selection properties are demonstrated in a simulation study. We then illustrate the method on a dataset of polytraumatized patients from Paris hospitals to predict the occurrence of hemorrhagic shock, a leading cause of early preventable death in severe trauma cases. The aim is to consolidate the current red flag procedure, a binary alert identifying patients with a high risk of severe hemorrhage. The methodology is implemented in the R package misaem.
Motivation & Objective
- To address the lack of robust methods for logistic regression with missing values in supervised learning.
- To develop a computationally efficient approach that supports full statistical inference, including parameter estimation and variance-covariance estimation.
- To enable model selection and prediction on test sets with missing data.
- To validate the method’s performance through simulation studies and real-world application in trauma patient outcomes.
- To implement the method in an accessible R package (misaem) for broader research use.
Proposed method
- A stochastic approximation version of the EM algorithm is used to iteratively estimate parameters in logistic regression with missing data.
- Metropolis-Hastings sampling is integrated into the E-step to handle the intractable integrals arising from missing data.
- The method jointly estimates regression coefficients and their standard errors, enabling confidence interval construction.
- Model selection is performed using a modified AIC criterion based on the observed log-likelihood.
- Prediction on test sets with missing values is enabled by integrating the imputation and estimation steps.
- The algorithm is implemented in the R package misaem for reproducible and scalable use.
Experimental results
Research questions
- RQ1Can a stochastic approximation EM algorithm with Metropolis-Hastings sampling effectively handle missing data in logistic regression?
- RQ2How does the proposed method perform in terms of parameter estimation accuracy and coverage of confidence intervals?
- RQ3Can the method support reliable variable selection and prediction on test sets with missing values?
- RQ4How does the method compare to existing approaches in terms of computational efficiency and statistical performance?
- RQ5Does the method improve the identification of high-risk trauma patients for hemorrhagic shock in real-world clinical data?
Key findings
- The proposed method achieves good coverage rates for confidence intervals, even with moderate to high missing data rates.
- Variable selection performance was strong, correctly identifying relevant predictors in simulation studies.
- The method demonstrated computational efficiency, scaling well with sample size and missing data proportion.
- In the polytrauma dataset, the method improved the identification of patients at risk of hemorrhagic shock compared to the standard red flag procedure.
- The implementation in the R package misaem enables practical application across diverse research settings.
- The approach successfully supports full statistical inference, including p-values and model selection, in the presence of missing data.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.