QUICK REVIEW

[论文解读] METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Ollie Liu, Sami Jaghouar|arXiv (Cornell University)|Jan 3, 2025

Genetics, Bioinformatics, and Biomedical Research被引用 3

一句话总结

METAGENE-1 以1.5万亿碱基对的宏基因组废水序列对解码器为主体的7B参数模型进行了预训练，以实现病原体检测、宏基因组嵌入和异常检测用于大流行监测。它在基因组基准和下游公共卫生任务上取得了最先进的结果。

ABSTRACT

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

研究动机与目标

动机：使用在多样化废水测序数据上训练的宏基因组基金模型，以捕捉广泛的微生物群分布。
描述数据集创建、分词和面向宏基因组数据的解码器优先 Transformer 架构。
在病原体检测、基因组嵌入和全基因组基准测试上评估 METAGENE-1。
展示下游应用，包括废水的异常检测和潜在的公共卫生应用。

提出的方法

在总计超过1.5万亿碱基对的宏基因组DNA/RNA 语料上预训练一个7十亿参数的自回归Transformer（仅解码器）。
使用字节对编码（BPE）对序列进行分词，词汇表大小为1,024 tokens，生成约3700亿个tokens。
以512 token上下文长度进行训练，采用打包读取并具备阻止跨读取注意力的注意力掩码。
采用密集Transformer结构，32层，32头，嵌入维度4096，使用RMSNorm；采用类似Adam的优化设置和余弦学习率调度。
通过混入已知物种基因组数据以1:8的比率进行持续预训练，以扩展泛化能力。
使用病原体检测 MCC 基准、基因组嵌入（Gene-MTEB）任务和基因组理解评估（GUE）子任务进行评估；评估嵌入质量与异常检测能力。

Figure 2 : Overview of the metagenomic data collection and sequencing pipeline for model pretraining . The process begins with the collection of wastewater (left), which contains genomic fragments from a diverse collection ( e.g. , tens of thousands) of constituent organisms (center). These samples

实验结果

研究问题

RQ1METAGENE-1 能否在不同的测序交付环境下稳定检测人类病原体？
RQ2从废水学到的宏基因组表征在嵌入与跨物种分类任务上能否泛化？
RQ3相较于以往的多物种模型，METAGENE-1 在标准基因组基准上的表现如何？
RQ4METAGENE-1 是否能支持废水感知监测中的异常检测与早期威胁检测？

主要发现

METAGENE-1 在四个数据集的病原体检测基准上持续超越竞争模型，MCC 提升范围为3–17点。
在Genomic Embedding任务中，METAGENE-1 实现了最高全局平均分并在Human-Virus及相关子任务上表现出色。
在Genome Understanding Evaluation中，METAGENE-1 获得28个子任务中的13项第一，尤为擅长表观遗传标记预测（EMP），但在启动子相关任务上仍有提升空间。
Gene-MTEB 嵌入结果显示零-shot 表征鲁棒，尤其在人类-病毒任务上，若干指标的准确性领先基线模型超过6点以上。
一次基于长度归一化交叉熵损失的异常检测实验显示宏基因组数据与非宏基因组数据之间存在清晰分离，表明用于废水OOD检测的潜力。

Figure 3 : Metagenomic composition of the METAGENE-1 pretraining dataset, estimated via Kraken 2 (Wood et al., 2019 ) sequence classification, and visualized via Krona (Ondov et al., 2011 ) . See Figure 7 for a more-detailed view.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。