Skip to main content
QUICK REVIEW

[Paper Review] Polygenic Modeling with Bayesian Sparse Linear Mixed Models

Xiang Zhou, Peter Carbonetto|arXiv (Cornell University)|Sep 6, 2012
Genetic and phenotypic traits in livestock30 citations
TL;DR

This paper introduces a Bayesian sparse linear mixed model (BSLMM) that unifies linear mixed models (LMMs) and sparse regression, enabling adaptive modeling of polygenic architecture. By combining the strengths of both approaches through data-driven hyperparameter estimation and a novel MCMC algorithm, BSLMM significantly improves prediction accuracy in phenotype prediction and provides robust estimates of chip heritability across diverse genetic architectures.

ABSTRACT

Both linear mixed models (LMMs) and sparse regression models are widely used in genetics applications, including, recently, polygenic modeling in genome-wide association studies. These two approaches make very different assumptions, so are expected to perform well in different situations. However, in practice, for a given data set one typically does not know which assumptions will be more accurate. Motivated by this, we consider a hybrid of the two, which we refer to as a "Bayesian sparse linear mixed model" (BSLMM) that includes both these models as special cases. We address several key computational and statistical issues that arise when applying BSLMM, including appropriate prior specification for the hyper-parameters, and a novel Markov chain Monte Carlo algorithm for posterior inference. We apply BSLMM and compare it with other methods for two polygenic modeling applications: estimating the proportion of variance in phenotypes explained (PVE) by available genotypes, and phenotype (or breeding value) prediction. For PVE estimation, we demonstrate that BSLMM combines the advantages of both standard LMMs and sparse regression modeling. For phenotype prediction it considerably outperforms either of the other two methods, as well as several other large-scale regression methods previously suggested for this problem. Software implementing our method is freely available from http://stephenslab.uchicago.edu/software.html

Motivation & Objective

  • To address the challenge of choosing between LMMs and sparse regression models in polygenic modeling when the true genetic architecture is unknown.
  • To develop a unified model that combines the strengths of both LMMs (for polygenic architectures) and sparse regression (for few causal variants).
  • To ensure reliable inference by deriving appropriate prior distributions for hyperparameters and estimating them from data.
  • To design an efficient MCMC algorithm that avoids ad hoc approximations and scales to large datasets with thousands of individuals and hundreds of thousands of SNPs.
  • To evaluate BSLMM's performance in two key applications: estimating proportion of variance explained (PVE) and predicting phenotypes.

Proposed method

  • Proposes a Bayesian sparse linear mixed model (BSLMM) that includes both LMM and Bayesian variable selection regression (BVSR) as special cases.
  • Uses a hierarchical prior structure with mixture priors on SNP effect sizes to allow for both small, polygenic effects and a few large effects.
  • Employs a novel MCMC algorithm leveraging a recent linear algebra trick to efficiently compute high-dimensional Gaussian integrals in LMMs.
  • Estimates hyperparameters (e.g., variance components, sparsity parameters) from the data using non-informative or weakly informative priors to ensure adaptivity.
  • Applies the model to both simulated data and real datasets (WTCCC, heterogeneous stock mice) for comparative evaluation.
  • Uses predictive performance metrics such as RMSE, correlation, AUC, and Brier score to benchmark against LMM, BVSR, and other large-scale regression methods.

Experimental results

Research questions

  • RQ1Can a unified model that combines LMMs and sparse regression outperform each individual method in estimating the proportion of variance in phenotypes explained by genotypes?
  • RQ2Does the BSLMM framework adaptively learn the underlying genetic architecture (e.g., number and size of causal variants) from data?
  • RQ3How does BSLMM perform in phenotype prediction compared to LMM, BVSR, and other large-scale regression methods across diverse genetic architectures?
  • RQ4Can the proposed MCMC algorithm efficiently handle large-scale genetic data with thousands of individuals and hundreds of thousands of SNPs?
  • RQ5Does data-driven estimation of hyperparameters lead to more robust and accurate inference than fixed hyperparameter values?

Key findings

  • BSLMM significantly outperforms both LMM and BVSR in phenotype prediction, with a mean Relative Predictive Gain (RPG) of 1.24 in simulation scenarios with medium/small-effect SNPs.
  • In the WTCCC data set, BSLMM achieved AUC values of 0.60–0.88 across seven diseases, with the highest AUC of 0.88 for type 1 diabetes, outperforming LMM and BVSR.
  • For the heterogeneous stock mouse data set, BSLMM achieved a mean RMSE of 0.70–0.99 across six data splits, with performance consistently better than LMM and BVSR.
  • In PVE estimation, BSLMM provided more accurate and stable estimates than LMM or BVSR, especially when the true genetic architecture was neither purely polygenic nor sparse.
  • The BSLMM model achieved a Brier score of 0.139 ± 0.006 for type 1 diabetes, significantly lower than other models, indicating superior performance in binary trait prediction.
  • The novel MCMC algorithm enabled reliable inference on large-scale data, avoiding the ad hoc approximations common in previous implementations of similar models.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.