QUICK REVIEW

[論文レビュー] Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA

Tomáš Brůna, Lars Gabriel|arXiv (Cornell University)|Mar 28, 2024

Genomics and Phylogenetic Studies被引用数 11

ひとこと要約

真核生物ゲノム注釈のために BRAKER、Galba、TSEBRA の実行方法を詳述した実践ガイド。入力、コンテインメント、ワークフローを含み、昆虫に焦点を当てたガイダンスを提供します。

ABSTRACT

Annotating the structure of protein-coding genes represents a major challenge in the analysis of eukaryotic genomes. This task sets the groundwork for subsequent genomic studies aimed at understanding the functions of individual genes. BRAKER and Galba are two fully automated and containerized pipelines designed to perform accurate genome annotation. BRAKER integrates the GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain high sensitivity and precision. BRAKER is adept at handling genomes of any size, provided that it has access to both transcript expression sequencing data and an extensive protein database from the target clade. In particular, BRAKER demonstrates high accuracy even with only one type of these extrinsic evidence sources, although it should be noted that accuracy diminishes for larger genomes under such conditions. In contrast, Galba adopts a distinct methodology utilizing the outcomes of direct protein-to-genome spliced alignments using miniprot to generate training genes and evidence for gene prediction in AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are the only source of evidence. This chapter provides practical guidelines for employing both pipelines in the annotation of eukaryotic genomes, with a focus on insect genomes.

研究の動機と目的

真核生物ゲノムにおけるタンパク質コード遺伝子の注釈にBRAKERとGalbaを適用するための実践的ガイドラインを提供する。
正確な予測のための転写体データとタンパク質証拠の準備と選択方法を説明する。
再現性のある解析を可能にするためのコンテナ化デプロイとHPCの考慮事項を説明する。
予測の統合と遺伝子集合の改善におけるTSEBRAの役割について論じる。

提案手法

BRAKERおよびGalbaパイプラインと、それらがRNA-Seq、タンパク質、予測からの証拠をどのように統合するかを説明する。
TSEBRAがAUGUSTUSおよびGeneMarkベースの出力からの予測を組み合わせて改善した遺伝子集合を作る方法を説明する。
再現性のあるワークフローのためのDockerおよびSingularityを用いたコンテナ化デプロイメントの概要。
ゲノムマスキング、転写体データ、タンパク質データベースの入力準備ワークフローを提供する。
パイプラインの実行を練習するための段階的な手順と toy/デモンストレーションデータセットを提供する。

Figure 1: Schematic view of the BRAKER [ 1 , 2 , 3 ] and Galba [ 4 ] pipelines. A: In BRAKER, GeneMark-ET, -EP, or -ETP [ 7 , 8 , 9 ] is trained (using extrinsic data upon availability) and used to predict an initial set of genes (genemark.gtf). This set of genes is filtered, and the resulting high-

実験結果

リサーチクエスチョン

RQ1BRAKERとGalbaは、転写体データとタンパク質といった異なる外来証拠源をどのように活用して遺伝子構造を予測するのか？
RQ2コンテナ化された環境でBRAKER、Galba、TSEBRAを設定・実行する実践的な手順は何ですか？
RQ3BRAKERとGalbaの予測を組み合わせたとき、TSEBRAは最終的な遺伝子集合にどのような影響を与えるのか？
RQ4これらのパイプラインの精度を最大化し実行時間を最小化する入力データ形式と前処理手順は何ですか？
RQ5これらのパイプラインは昆虫ゲノムや証拠の可用性が異なる場合、スケーラブルなゲノムサイズでどのように性能を発揮しますか？

主な発見

BRAKER3はRNA-Seqと大規模なタンパク質データベースを統合することで高精度を提供し、予測を結合するのにTSEBRAを使用します。
Galbaはminiprotを用いたタンパク質-ゲノムスプライス整列とAUGUSTUSの訓練により、大規模ゲノムで高い精度を提供します。
TSEBRAはAUGUSTUSとGeneMarkの予測を統合して遺伝子集合を改善する結合者として機能します。
Iso-SeqデータはBRAKER3ワークフローのために修正されたGeneMark-ETPコンテナと組み込むことができます。
ゲノムマスキングと反復要素の慎重な取り扱いは、信頼性の高い遺伝子予測にとって重要です。

Figure 3: Decision scheme for picking a suitable pipeline out of BRAKER3, BRAKER2, BRAKER1 (in combination with BRAKER2 and TSEBRA), and Galba.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。