QUICK REVIEW

[논문 리뷰] Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA

Tomáš Brůna, Lars Gabriel|arXiv (Cornell University)|2024. 03. 28.

Genomics and Phylogenetic Studies인용 수 11

한 줄 요약

유실험 생물 게놈 주석을 위한 BRAKER, Galba, 및 TSEBRA 실행에 대한 실용 가이드로, 입력, 포함 및 워크플로우를 다루며 곤충에 초점을 맞춘 지침을 제공합니다.

ABSTRACT

Annotating the structure of protein-coding genes represents a major challenge in the analysis of eukaryotic genomes. This task sets the groundwork for subsequent genomic studies aimed at understanding the functions of individual genes. BRAKER and Galba are two fully automated and containerized pipelines designed to perform accurate genome annotation. BRAKER integrates the GeneMark-ETP and AUGUSTUS gene finders, employing the TSEBRA combiner to attain high sensitivity and precision. BRAKER is adept at handling genomes of any size, provided that it has access to both transcript expression sequencing data and an extensive protein database from the target clade. In particular, BRAKER demonstrates high accuracy even with only one type of these extrinsic evidence sources, although it should be noted that accuracy diminishes for larger genomes under such conditions. In contrast, Galba adopts a distinct methodology utilizing the outcomes of direct protein-to-genome spliced alignments using miniprot to generate training genes and evidence for gene prediction in AUGUSTUS. Galba has superior accuracy in large genomes if protein sequences are the only source of evidence. This chapter provides practical guidelines for employing both pipelines in the annotation of eukaryotic genomes, with a focus on insect genomes.

연구 동기 및 목표

진핵생물 게놈에서 단백질 코딩 유전자를 주석화하기 위해 BRAKER와 Galba를 적용하기 위한 실용 가이드라인을 제공합니다.
정확한 예측을 위해 전사체(transcriptome) 및 단백질 증거를 준비하고 선택하는 방법을 설명합니다.
재현 가능한 분석을 가능하게 하는 컨테이너화된 배포 및 HPC 고려사항을 설명합니다.
예측을 결합하고 유전자 집합을 개선하는 데 있어 TSEBRA의 역할을 논의합니다.

제안 방법

BRAKER 및 Galba 파이프라인과 이들이 RNA-Seq, 단백질, 예측으로부터 증거를 어떻게 통합하는지 설명합니다.
TSEBRA가 AUGUSTUS 및 GeneMark 기반 출력에서의 예측을 어떻게 결합하여 향상된 유전자 집합을 만드는지 설명합니다.
재현 가능한 워크플로우를 위한 Docker 및 Singularity를 활용한 컨테이너화된 배포를 개요합니다.
게놈 마스킹, 전사체 데이터 및 단백질 데이터베이스를 위한 입력 준비 워크플로우를 제공합니다.
파이프라인 실행 연습을 위한 단계별 지침과 연습용 데이터 세트(토이/시연 데이터)를 제공합니다.

Figure 1: Schematic view of the BRAKER [ 1 , 2 , 3 ] and Galba [ 4 ] pipelines. A: In BRAKER, GeneMark-ET, -EP, or -ETP [ 7 , 8 , 9 ] is trained (using extrinsic data upon availability) and used to predict an initial set of genes (genemark.gtf). This set of genes is filtered, and the resulting high-

실험 결과

연구 질문

RQ1BRAKER와 Galba가 서로 다른 외부 증거 소스(전사체와 단백질)를 어떻게 활용하여 유전자 구조를 예측합니까?
RQ2컨테이너화된 환경에서 BRAKER, Galba, 그리고 TSEBRA를 설정하고 실행하기 위한 실용적 단계는 무엇입니까?
RQ3BRAKER와 Galba 예측을 결합할 때 TSEBRA가 최종 유전자 집합에 어떤 영향을 줍니까?
RQ4이 파이프라인들의 정확도를 최대화하고 런타임을 최소화하는 입력 데이터 형식과 전처리 단계는 무엇입니까?
RQ5곤충 게놈 및 다양한 증거 가용성에 따라 확장 가능한 게놈 크기에서 이들 파이프라인의 성능은 어떠합니까?

주요 결과

BRAKER3은 RNA-Seq와 대형 단백질 데이터베이스를 통합하여 높은 정확도를 제공하고 예측을 결합하기 위해 TSEBRA를 사용합니다.
Galba는 miniprot를 사용한 단백질-게놈 스플라이스 정렬과 AUGUSTUS 훈련으로 대형 게놈에서 높은 정확도를 제공합니다.
TSEBRA는 AUGUSTUS와 GeneMark 예측을 병합하여 유전자 집합을 개선하는 조합기로 작동합니다.
Iso-Seq 데이터는 BRAKER3 워크플로우용으로 수정된 GeneMark-ETP 컨테이너와 함께 도입될 수 있습니다.
게놈 마스킹 및 반복 요소의 신중한 다룸은 신뢰할 수 있는 유전자 예측에 필수적입니다.

Figure 3: Decision scheme for picking a suitable pipeline out of BRAKER3, BRAKER2, BRAKER1 (in combination with BRAKER2 and TSEBRA), and Galba.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.