QUICK REVIEW

[论文解读] Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA

Tomáš Brůna, Lars Gabriel|arXiv (Cornell University)|Mar 28, 2024

Genomics and Phylogenetic Studies被引用 11

一句话总结

一本实用指南，详细介绍如何运行 BRAKER、Galba 和 TSEBRA 进行真核生物基因组注释，包括输入、容器化，以及工作流程，并提供针对昆虫的指导。

ABSTRACT

Annotating the structure of protein-coding genes represents a major challenge in the analysis of eukaryotic genomes. This task sets the groundwork for subsequent genomic studies aimed at understanding the functions of individual genes. BRAKER and Galba are two fully automated and containerized pipelines designed to perform accurate genome annotation. BRAKER integrates the GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain high sensitivity and precision. BRAKER is adept at handling genomes of any size, provided that it has access to both transcript expression sequencing data and an extensive protein database from the target clade. In particular, BRAKER demonstrates high accuracy even with only one type of these extrinsic evidence sources, although it should be noted that accuracy diminishes for larger genomes under such conditions. In contrast, Galba adopts a distinct methodology utilizing the outcomes of direct protein-to-genome spliced alignments using miniprot to generate training genes and evidence for gene prediction in AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are the only source of evidence. This chapter provides practical guidelines for employing both pipelines in the annotation of eukaryotic genomes, with a focus on insect genomes.

研究动机与目标

为在真核基因组中标注蛋白编码基因，提供应用 BRAKER 和 Galba 的实用指南。
解释如何准备并挑选转录组和蛋白质证据以获得准确的预测。
描述容器化部署和 HPC 注意事项，以实现可重复分析。
讨论 TSEBRA 在结合预测和改进基因集中的作用。

提出的方法

描述 BRAKER 与 Galba 流水线以及它们如何整合来自 RNA-Seq、蛋白质和预测的证据。
解释 TSEBRA 如何将 AUGUSTUS 和 GeneMark 基于的输出的预测融合以获得改进的基因集。
概述使用 Docker 和 Singularity 进行可重复工作流的容器化部署。
提供基因组屏蔽、转录组数据和蛋白质数据库的输入准备工作流程。
提供分步说明和模拟/演示数据集以练习运行这些流水线。

Figure 1: Schematic view of the BRAKER [ 1 , 2 , 3 ] and Galba [ 4 ] pipelines. A: In BRAKER, GeneMark-ET, -EP, or -ETP [ 7 , 8 , 9 ] is trained (using extrinsic data upon availability) and used to predict an initial set of genes (genemark.gtf). This set of genes is filtered, and the resulting high-

实验结果

研究问题

RQ1BRAKER 与 Galba 如何利用不同的外源证据来源（转录组和蛋白质）来预测基因结构？
RQ2在容器化环境中设置和运行 BRAKER、Galba 和 TSEBRA 的实际步骤是什么？
RQ3将 BRAKER 与 Galba 的预测结合时，TSEBRA 如何影响最终基因集？
RQ4哪些输入数据格式和预处理步骤能最大化准确性并最小化这些流水线的运行时间？
RQ5在昆虫基因组和具有不同证据可用性的可扩展基因组规模下，这些流水线的表现如何？

主要发现

BRAKER3 通过整合 RNA-Seq 和大型蛋白质数据库实现高准确性，并使用 TSEBRA 来合并预测。
Galba 在大基因组上提供较高的准确性，使用蛋白-到基因组的拼接比对（miniprot）并训练 AUGUSTUS。
TSEBRA 作为一个组合器，通过合并 AUGUSTUS 和 GeneMark 的预测来改进基因集。
Iso-Seq 数据可以通过修改的 GeneMark-ETP 容器纳入 BRAKER3 工作流。
基因组屏蔽和对重复元件的仔细处理对于可靠的基因预测至关重要。

Figure 3: Decision scheme for picking a suitable pipeline out of BRAKER3, BRAKER2, BRAKER1 (in combination with BRAKER2 and TSEBRA), and Galba.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。