[论文解读] Convolutional neural network models for cancer type prediction based on gene expression
本研究提出了一种1D-CNN、2D-Vanilla-CNN和2D-Hybrid-CNN模型,利用The Cancer Genome Atlas(TCGA)数据,从非结构化的基因表达数据中预测癌症类型。这些模型在34类(33种癌症类型和正常)中实现了93.9%–95.0%的准确率,通过引导显著性分析识别出2,090个癌症标志物,包括GATA3和ESR1等已知标志物;进一步扩展至乳腺癌分型预测,准确率达到88.42%。
Background Precise prediction of cancer types is vital for cancer diagnosis and therapy. Important cancer marker genes can be inferred through predictive model. Several studies have attempted to build machine learning models for this task however none has taken into consideration the effects of tissue of origin that can potentially bias the identification of cancer markers. Results In this paper, we introduced several Convolutional Neural Network (CNN) models that take unstructured gene expression inputs to classify tumor and non-tumor samples into their designated cancer types or as normal. Based on different designs of gene embeddings and convolution schemes, we implemented three CNN models: 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN. The models were trained and tested on combined 10,340 samples of 33 cancer types and 731 matched normal tissues of The Cancer Genome Atlas (TCGA). Our models achieved excellent prediction accuracies (93.9-95.0%) among 34 classes (33 cancers and normal). Furthermore, we interpreted one of the models, known as 1D-CNN model, with a guided saliency technique and identified a total of 2,090 cancer markers (108 per class). The concordance of differential expression of these markers between the cancer type they represent and others is confirmed. In breast cancer, for instance, our model identified well-known markers, such as GATA3 and ESR1. Finally, we extended the 1D-CNN model for prediction of breast cancer subtypes and achieved an average accuracy of 88.42% among 5 subtypes. The codes can be found at https://github.com/chenlabgccri/CancerTypePrediction.
研究动机与目标
- 开发无需预先特征工程的深度学习模型,以从非结构化的基因表达数据中预测癌症类型。
- 通过将基因表达模式直接整合到模型架构中,解决癌症标志物识别中的组织来源偏差问题。
- 通过引导显著性等可解释性技术,识别具有生物相关性的癌症标志基因。
- 将模型扩展至高精度预测乳腺癌分子亚型。
提出的方法
- 采用三种卷积神经网络架构:1D-CNN、2D-Vanilla-CNN和2D-Hybrid-CNN,每种架构均设计用于处理原始基因表达向量。
- 使用基因嵌入将基因表达谱表示为输入张量,使模型能够学习分层模式。
- 应用一维和二维卷积,以捕捉样本和基因之间基因表达数据的局部和空间模式。
- 利用引导反向传播显著性图解释模型决策,并识别每种癌症类型的关联重要基因。
- 在TCGA中来自33种癌症类型的10,340个肿瘤组织样本和731个正常组织样本的综合数据集上训练和评估模型。
- 将1D-CNN模型扩展用于乳腺癌分型,使用多分类头预测五种分子亚型。
实验结果
研究问题
- RQ1卷积神经网络能否在无需预先特征选择的情况下,直接从非结构化的基因表达数据中有效分类癌症类型?
- RQ2不同CNN架构(一维与二维)及嵌入策略对多类别癌症预测性能有何影响?
- RQ3可解释性技术如引导显著性能否识别出具有生物相关性的已知和新发癌症标志基因?
- RQ41D-CNN模型在更细粒度的分类任务(如乳腺癌分型)中具有多大程度的泛化能力?
- RQ5所识别的标志基因与已知的跨癌症类型的差异表达模式是否具有一致性?
主要发现
- 1D-CNN模型在从基因表达数据中对34类(33种癌症类型和正常)进行分类时,取得了95.0%的最高测试准确率。
- 2D-Hybrid-CNN模型表现出色,准确率达到94.5%,表明结合二维卷积层与全局池化具有优势。
- 引导显著性分析识别出2,090个癌症标志物(每种癌症类型约108个),与已知的差异表达模式高度一致。
- 在乳腺癌中,模型成功识别出GATA3和ESR1等公认的标志物,验证了其生物相关性。
- 扩展后的1D-CNN模型在预测乳腺癌五种分子亚型时,平均准确率达到88.42%。
- 通过直接从基因表达谱中学习癌症特异性模式,模型对组织来源偏差表现出鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。