Skip to main content
QUICK REVIEW

[Paper Review] NeuMiss networks: differentiable programming for supervised learning with missing values

Marine Le Morvan, Julie Josse|arXiv (Cornell University)|Jul 3, 2020
Machine Learning and ELM32 references14 citations
TL;DR

NeuMiss networks introduce a differentiable neural network architecture that explicitly models missing data patterns using learnable multiplicative nonlinearities with the missingness indicator. By approximating the optimal Bayes predictor via a Neumann series, the method achieves state-of-the-art performance on supervised learning with missing values—especially under MNAR mechanisms—while maintaining computational and sample complexity independent of the number of missing patterns.

ABSTRACT

The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.

Motivation & Objective

  • Address the challenge of supervised learning with missing values, particularly under complex missing data mechanisms like MNAR.
  • Overcome the exponential computational and sample complexity of traditional methods that model all 2^d missing patterns explicitly.
  • Develop a theoretically grounded neural network architecture that implicitly learns to impute values based on observed data and missingness patterns.
  • Ensure robustness to unknown or complex missing data mechanisms, including self-masking MNAR, where standard imputation or EM fails.
  • Achieve high predictive accuracy with low sample and computational complexity, scalable to high-dimensional data.

Proposed method

  • Derive the analytical form of the Bayes predictor for linear regression under MAR and MNAR mechanisms, including self-masking.
  • Approximate the optimal predictor using a Neumann series expansion, enabling differentiable optimization.
  • Introduce a novel non-linearity: element-wise multiplication of hidden representations by the missingness indicator (⊙M), enabling pattern-aware learning.
  • Design a deep architecture where each layer applies the ⊙M non-linearity, allowing the network to learn complex, data-dependent imputations.
  • Train the network via stochastic gradient descent with a standard loss (e.g., MSE), ensuring end-to-end differentiability and convergence to a consistent predictor.
  • Use residual connections in deeper variants to stabilize training and improve generalization.

Experimental results

Research questions

  • RQ1What is the analytical form of the optimal predictor for linear regression when data are missing under MAR and MNAR mechanisms?
  • RQ2Can a neural network architecture be designed to implicitly learn the optimal imputation function without explicitly modeling all 2^d missing patterns?
  • RQ3How does the use of ⊙M nonlinearities—multiplication by the missingness indicator—improve generalization and robustness to missing data mechanisms?
  • RQ4Does the NeuMiss architecture achieve better predictive performance than standard methods like EM or MICE, especially under MNAR settings?
  • RQ5What is the theoretical and empirical sample complexity of NeuMiss networks compared to methods that require 2^d models?

Key findings

  • NeuMiss networks achieve near-optimal performance, with R² scores within 1% of the Bayes rate on high-dimensional datasets (d = 10, n = 10^5) under MCAR and MAR.
  • In self-masking MNAR settings, NeuMiss significantly outperforms EM and MICE, with performance gaps widening as sample size increases.
  • The architecture maintains low computational complexity O(d²) and sample complexity O(d²), independent of the number of missing patterns 2^d.
  • NeuMiss networks are robust to MNAR mechanisms, including self-masking, where EM and imputation-based methods fail due to model misspecification.
  • Increasing the capacity of NeuMiss networks improves prediction accuracy, unlike classical MLPs where deeper networks do not yield gains.
  • A shallow version of NeuMiss is mathematically equivalent to a standard MLP with masked input, providing theoretical justification for the common practice of concatenating the mask.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.