Skip to main content
QUICK REVIEW

[Paper Review] Efficient EM Training of Gaussian Mixtures with Missing Data

Olivier Delalleau, Aaron Courville|arXiv (Cornell University)|Sep 4, 2012
Bayesian Methods and Mixture Models10 references20 citations
TL;DR

This paper proposes a spanning-tree-based algorithm to accelerate EM training of Gaussian Mixture Models (GMMs) with missing data, significantly reducing computational cost. By leveraging matrix updates over a minimum spanning tree of missing patterns, the method achieves up to an order-of-magnitude speedup while enabling effective imputation of missing values via conditional expectations, which outperforms global mean and nearest-neighbor imputation when used with discriminative models.

ABSTRACT

In data-mining applications, we are frequently faced with a large fraction of missing entries in the data matrix, which is problematic for most discriminant machine learning algorithms. A solution that we explore in this paper is the use of a generative model (a mixture of Gaussians) to compute the conditional expectation of the missing variables given the observed variables. Since training a Gaussian mixture with many different patterns of missing values can be computationally very expensive, we introduce a spanning-tree based algorithm that significantly speeds up training in these conditions. We also observe that good results can be obtained by using the generative model to fill-in the missing values for a separate discriminant learning algorithm.

Motivation & Objective

  • To address the high computational cost of standard EM training for Gaussian Mixture Models when dealing with missing data in high-dimensional datasets.
  • To develop a scalable and efficient training algorithm that reduces the time complexity of EM updates under diverse missing data patterns.
  • To evaluate the effectiveness of using conditional expectation imputation from a trained GMM as a preprocessing step for discriminative models.
  • To demonstrate that generative modeling of missing data distribution improves performance of downstream discriminative learning algorithms.

Proposed method

  • Proposes a spanning-tree-based algorithm to organize and group missing data patterns, enabling efficient matrix computations during EM training.
  • Uses matrix updates over the spanning tree to compute conditional expectations and update parameters without inverting large covariance matrices for each unique missing pattern.
  • Applies the EM algorithm to learn a mixture of Gaussians with full covariance matrices, assuming missing data are Missing At Random (MAR).
  • Imputes missing values using the conditional expectation $ \mathbb{E}[x_m \mid x_o] $, derived analytically from the learned GMM.
  • Trains the GMM on the full data matrix with missing entries, using iterative E-step and M-step updates with optimized matrix operations.
  • Combines the GMM imputation with discriminative models (neural networks and kernel ridge regression) to improve prediction performance.

Experimental results

Research questions

  • RQ1Can EM training of Gaussian Mixture Models be made computationally feasible for high-dimensional datasets with diverse missing data patterns?
  • RQ2Does using conditional expectation imputation from a GMM improve the performance of downstream discriminative models compared to simple imputation methods?
  • RQ3Can a spanning-tree structure over missing patterns reduce the computational cost of EM updates without sacrificing model accuracy?
  • RQ4How does the performance of GMM-based imputation compare to global mean and nearest-neighbor imputation in terms of prediction error?
  • RQ5Does combining generative imputation with discriminative learning yield better results than using the GMM directly as a predictor?

Key findings

  • The proposed spanning-tree-based algorithm reduces EM training time by up to an order of magnitude compared to standard EM on datasets with many missing patterns.
  • Conditional expectation imputation from a trained GMM significantly outperforms global mean and nearest-neighbor imputation in terms of test mean-squared error on the Abalone dataset.
  • When combined with discriminative models like neural networks and kernel ridge regression, GMM-based imputation leads to lower test error than using the GMM alone as a regressor.
  • The improvement from GMM imputation is most pronounced as the proportion of missing values increases, where nearest-neighbor methods degrade due to lack of nearby complete samples.
  • The method remains effective even in high-dimensional settings, where standard EM becomes computationally prohibitive due to the exponential number of possible missing patterns.
  • The results validate that generative models trained on the full data distribution can provide useful inductive bias for discriminative learning, especially when data is incomplete.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.