QUICK REVIEW

[论文解读] Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks

Jack Lanchantin, Ritambhara Singh|arXiv (Cornell University)|Aug 12, 2016

Genomics and Chromatin Dynamics参考文献 15被引用 25

一句话总结

本文介绍了深度基序仪表板（DeMo Dashboard），一种用于解释深度神经网络（DNN）模型在转录因子结合位点（TFBS）分类任务中的可视化工具包，结合显著性图、时间序列输出分数和类别特异性优化方法。CNN-RNN架构在性能上优于其他模型，可视化结果揭示其能捕捉到基序以及长程依赖关系，从而为转录因子为何结合到特定基因组序列提供见解。

ABSTRACT

Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.

研究动机与目标

为解决深度神经网络（DNN）在基因组学中，特别是TFBS分类任务上的可解释性挑战。
开发一种可视化工具包，帮助研究人员理解DNN为何对转录因子结合做出特定预测。
比较三种DNN架构——CNN、RNN和CNN-RNN——在TFBS分类任务中的性能与可解释性。
利用基序匹配工具评估各模型内部表征与已知生物学基序的对齐程度。
证明可视化DNN不仅能揭示已知基序，还能发现传统基序查找工具可能遗漏的基序间长程依赖关系。

提出的方法

使用一阶导数生成显著性图，以突出显示对模型预测最具影响力的核苷酸。
通过时间序列输出分数追踪模型对序列输入的预测置信度变化，揭示序列中关键位置。
通过随机梯度优化实现类别特异性可视化，以生成最能代表正向TFBS类别的输入序列。
工具包评估了三种架构：卷积神经网络（CNN）、循环神经网络（RNN）以及混合CNN-RNN模型。
使用Tomtom进行基序匹配，将可视化生成的基序与已知的JASPAR基序进行比较。
性能通过AUC分数和在57个TF数据集上的基序匹配准确率进行评估。

实验结果

研究问题

RQ1在CNN、RNN和CNN-RNN中，哪种DNN架构在TFBS分类任务中表现最佳，原因是什么？
RQ2显著性图和时间序列输出分数在多大程度上揭示了DNN在基因组序列分类中的决策过程？
RQ3类别特异性优化在多大程度上能生成与已知转录因子结合模式相符的生物上合理的基序？
RQ4可视化技术能否揭示传统基序查找工具所忽略的基序之间的长程依赖关系？
RQ5DNN可视化生成的基序与JASPAR数据库中的已知基序对齐程度如何？

主要发现

CNN-RNN架构在三类模型中取得了最高的AUC分数，在TFBS分类任务中表现优于CNN和RNN。
显著性图显示，CNN-RNN在具有挑战性的序列（NFYB）中聚焦于两个不同区域，解释了其在CNN和RNN失败时仍能正确分类的原因。
时间序列输出分数显示，模型在接近已知JASPAR基序的位置由负向预测转为正向预测，表明这些位置为关键结合位点。
类别优化生成的序列呈现出类似已知基序的模式，其中CNN生成的基序模式最清晰，而CNN-RNN则捕捉到更复杂的依赖关系。
使用Tomtom进行基序匹配显示，CNN提取基序的准确性最高（57个TF中匹配19个），其次为CNN-RNN（13个匹配），RNN最低（11个匹配）。
结果表明，尽管CNN在基序检测方面表现优异，但CNN-RNN的优势在于建模基序之间的依赖关系，这促使其整体性能更优。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。