QUICK REVIEW

[论文解读] Identifying viruses from metagenomic data by deep learning

Jie Ren, Kai Song|arXiv (Cornell University)|Jun 20, 2018

Bacteriophages and microbial interactions参考文献 30被引用 26

一句话总结

DeepVirFinder 是一种无需参考基因组、无需序列比对的深度学习方法，通过卷积神经网络对病毒k-mer频率进行训练，从而在宏基因组数据中识别病毒序列。该方法在所有contig长度上均优于VirFinder，并在结直肠癌患者中鉴定出175个病毒群落，其中10个群落与癌症状态显著相关，实现了非侵入性诊断。

ABSTRACT

The recent development of metagenomic sequencing makes it possible to sequence microbial genomes including viruses in an environmental sample. Identifying viral sequences from metagenomic data is critical for downstream virus analyses. The existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences. Here we have developed a reference-free and alignment-free machine learning method, DeepVirFinder, for predicting viral sequences in metagenomic data using deep learning techniques. DeepVirFinder was trained based on a large number of viral sequences discovered before May 2015. Evaluated on the sequences after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths. Enlarging the training data by adding millions of purified viral sequences from environmental metavirome samples significantly improves the accuracy for predicting under-represented viruses. Applying DeepVirFinder to real human gut metagenomic samples from patients with colorectal carcinoma (CRC) identified 51,138 viral sequences belonging to 175 bins. Ten bins were associated with the cancer status, indicating their potential use for non-invasive diagnosis of CRC. In summary, DeepVirFinder greatly improved the precision and recall rates of viral identification, and it will significantly accelerate the discovery rate of viruses.

研究动机与目标

开发一种无需参考基因组和无需序列比对的方法，用于在宏基因组数据中识别病毒序列。
提高对传统基于同源性的方法所遗漏的未知和短病毒contig的检测能力。
通过利用大规模病毒序列数据的深度学习提升病毒识别的准确性。
通过将病毒序列与疾病状态关联，实现结直肠癌（CRC）的非侵入性诊断。
通过在训练中纳入环境宏病毒组数据，拓展对代表性不足的病毒类群的检测能力。

提出的方法

使用来自病毒和非病毒序列的k-mer频率模式，训练卷积神经网络（CNN）以将宏基因组contig分类为病毒或非病毒序列。
采用包含RefSeq中病毒序列以及来自环境宏病毒组数据集（如IBD、SAM、TOV、健康肠道等）的数百万条纯化病毒contig的大规模训练集。
通过整合宏病毒组来源的病毒序列实施数据增强，以提升对代表性不足的病毒家族的检测能力。
使用COCACOLA将预测的病毒contig聚类为175个群落，基于序列相似性和丰度。
使用bowtie2将reads比对到病毒群落，并计算RPKM以进行丰度量化。
采用带L1惩罚的逻辑回归模型，以RPKM值作为预测变量，识别与CRC状态显著相关的病毒群落。

实验结果

研究问题

RQ1深度学习模型是否能在识别宏基因组数据中的病毒序列方面优于现有的基于参考基因组和基于同源性的方法？
RQ2在训练中纳入环境宏病毒组序列在多大程度上提升了对代表性不足的病毒家族的检测能力？
RQ3DeepVirFinder鉴定出的病毒contig是否可以聚类为与疾病状态相关的生物学上有意义的群落？
RQ4特定病毒群落是否与人类肠道宏基因组中的结直肠癌（CRC）状态显著相关？
RQ5DeepVirFinder是否可通过病毒特征检测实现结直肠癌的非侵入性诊断？

主要发现

在2015年5月之后的序列中，DeepVirFinder在所有contig长度上均优于VirFinder，表现出更高的精确率和召回率。
在训练中纳入环境宏病毒组序列显著提升了对代表性不足的病毒家族的检测准确性。
DeepVirFinder在结直肠癌患者的人类肠道宏基因组中鉴定出51,138条病毒序列，分属于175个contig群落。
10个病毒群落（B19、B60、B61、B218、B227等）与CRC状态显著相关，其回归系数范围为-0.3475至0.1764。
聚类分析显示，175个病毒群落中，31.1%至96.15%的contig含有蛋白质，主要匹配对象包括噬菌体相关蛋白和未分类噬菌体。
基于病毒群落RPKM值构建的逻辑回归模型，通过L1正则化建模，在CRC状态预测中表现出显著的分类性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。