Skip to main content
QUICK REVIEW

[论文解读] Haplotype-based variant detection from short-read sequencing

Erik Garrison, Gábor Marth|arXiv (Cornell University)|Jul 17, 2012
Gene expression and cancer classification参考文献 23被引用 4,047
一句话总结

该论文开发了一个贝叶斯框架,用于从短读序列中检测单倍型,并在 FreeBayes 中实现,以处理多等位基因位点和非均匀拷贝数。

ABSTRACT

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.

研究动机与目标

  • Motivate haplotype-based variant detection to utilize short-range phasing information from sequencing traces.
  • Generalize variant detection to multiallelic loci and non-uniform copy number across samples.
  • Develop a Bayesian model to compute P(G1,...,Gn|R1,...,Rn) incorporating data likelihood and priors.
  • Implement a haplotype-based detector (FreeBayes) and provide posterior quality metrics.
  • Enable direct detection of longer haplotypes and improve genotyping accuracy via local imputation concepts.

提出的方法

  • Define n samples with copy number mi and M total copies, and K alleles with frequencies fi at a locus.
  • Extend Bayes’ rule to P(G1,...,Gn|R1,...,Rn) with data likelihood P(Ri|Gi) and priors based on population allele frequencies using Ewens’ sampling formula.
  • Compute P(Ri|Gi) by accounting for observed alleles from reads, using multinomial sampling adjusted for base qualities and mapping qualities.
  • Decompose priors into P(G1,...,Gn|f1,...,fk) and P(f1,...,fk), adjusting for unphased genotypes and using the multinomial coefficient with allele frequencies.
  • Apply Ewens’ sampling formula to approximate P(f1,...,fk) under a neutral mutation-drift model with parameter θ.
  • Assemble haplotype observations within dynamically determined windows, anchored by reference sequence, and compute P(G1,...,Gn|R1,...,Rn) via gradient ascent to a maximum a posteriori solution.
  • Provide outputs including the locus polymorphism probability P(K>1|R1,...,Rn) and marginal genotype likelihoods P(Gj|Ri,...,Rn).

实验结果

研究问题

  • RQ1Can multiallelic loci and non-uniform copy number be modeled within a Bayesian haplotype framework for variant detection?
  • RQ2Does incorporating population-level priors and phasing information improve haplotype-based variant detection from short reads?
  • RQ3Can longer haplotypes be detected directly from short-read data by assembling local haplotype observations?
  • RQ4How effective is the method at distinguishing true haplotypes from sequencing errors using base/ mapping qualities?
  • RQ5What are the quality outputs (polymorphism probability and marginal genotype likelihoods) produced by the method?

主要发现

  • A Bayesian framework is developed to model multiallelic loci and non-uniform copy number for haplotype-based variant detection.
  • The approach generalizes prior and likelihood computations to handle unphased genotypes and uses Ewens’ sampling formula to estimate allele frequency priors.
  • A haplotype detector (FreeBayes) assembles haplotype observations in dynamic windows and uses gradient ascent to find a maximum a posteriori multi-sample genotype.
  • The method yields a posterior probability of polymorphism at a locus, P(K>1|R1,...,Rn), and provides marginal genotype likelihoods for individuals.
  • Incorporating local imputation-like refinement improves raw genotype quality over purely maximum-likelihood approaches.
  • The framework enables direct detection of longer haplotypes from short-read sequencing data by modeling multiallelic haplotypes within a unified Bayesian context.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。