QUICK REVIEW
[Paper Review] Haplotype-based variant detection from short-read sequencing
Erik Garrison, Gábor Marth|arXiv (Cornell University)|Jul 17, 2012
Gene expression and cancer classification23 references4,047 citations
TL;DR
The paper develops a Bayesian framework for detecting haplotypes from short-read sequencing and implements it in FreeBayes to handle multiallelic loci and non-uniform copy number.
ABSTRACT
The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.
Motivation & Objective
- Motivate haplotype-based variant detection to utilize short-range phasing information from sequencing traces.
- Generalize variant detection to multiallelic loci and non-uniform copy number across samples.
- Develop a Bayesian model to compute P(G1,...,Gn|R1,...,Rn) incorporating data likelihood and priors.
- Implement a haplotype-based detector (FreeBayes) and provide posterior quality metrics.
- Enable direct detection of longer haplotypes and improve genotyping accuracy via local imputation concepts.
Proposed method
- Define n samples with copy number mi and M total copies, and K alleles with frequencies fi at a locus.
- Extend Bayes’ rule to P(G1,...,Gn|R1,...,Rn) with data likelihood P(Ri|Gi) and priors based on population allele frequencies using Ewens’ sampling formula.
- Compute P(Ri|Gi) by accounting for observed alleles from reads, using multinomial sampling adjusted for base qualities and mapping qualities.
- Decompose priors into P(G1,...,Gn|f1,...,fk) and P(f1,...,fk), adjusting for unphased genotypes and using the multinomial coefficient with allele frequencies.
- Apply Ewens’ sampling formula to approximate P(f1,...,fk) under a neutral mutation-drift model with parameter θ.
- Assemble haplotype observations within dynamically determined windows, anchored by reference sequence, and compute P(G1,...,Gn|R1,...,Rn) via gradient ascent to a maximum a posteriori solution.
- Provide outputs including the locus polymorphism probability P(K>1|R1,...,Rn) and marginal genotype likelihoods P(Gj|Ri,...,Rn).
Experimental results
Research questions
- RQ1Can multiallelic loci and non-uniform copy number be modeled within a Bayesian haplotype framework for variant detection?
- RQ2Does incorporating population-level priors and phasing information improve haplotype-based variant detection from short reads?
- RQ3Can longer haplotypes be detected directly from short-read data by assembling local haplotype observations?
- RQ4How effective is the method at distinguishing true haplotypes from sequencing errors using base/ mapping qualities?
- RQ5What are the quality outputs (polymorphism probability and marginal genotype likelihoods) produced by the method?
Key findings
- A Bayesian framework is developed to model multiallelic loci and non-uniform copy number for haplotype-based variant detection.
- The approach generalizes prior and likelihood computations to handle unphased genotypes and uses Ewens’ sampling formula to estimate allele frequency priors.
- A haplotype detector (FreeBayes) assembles haplotype observations in dynamic windows and uses gradient ascent to find a maximum a posteriori multi-sample genotype.
- The method yields a posterior probability of polymorphism at a locus, P(K>1|R1,...,Rn), and provides marginal genotype likelihoods for individuals.
- Incorporating local imputation-like refinement improves raw genotype quality over purely maximum-likelihood approaches.
- The framework enables direct detection of longer haplotypes from short-read sequencing data by modeling multiallelic haplotypes within a unified Bayesian context.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.