QUICK REVIEW

[论文解读] Variational embedding of protein folding simulations using Gaussian mixture variational autoencoders

Mahdi Ghorbani, Samarjeet Prasad|arXiv (Cornell University)|Aug 27, 2021

Protein Structure and Dynamics参考文献 66被引用 29

一句话总结

该论文提出了一种高斯混合变分自编码器（GMVAE），通过使用Gumbel-Softmax重参数化方法，实现了对蛋白质折叠轨迹的联合降维与聚类，具备端到端可微性。该方法学习到了具有明显分离亚稳态的漏斗形自由能景观，其潜在空间能够实现与基于TICA的马尔可夫模型相当的折叠时间尺度的精确动力学建模。

ABSTRACT

Conformational sampling of biomolecules using molecular dynamics simulations often produces large amount of high dimensional data that makes it difficult to interpret using conventional analysis techniques. Dimensionality reduction methods are thus required to extract useful and relevant information. Here we devise a machine learning method, Gaussian mixture variational autoencoder (GMVAE) that can simultaneously perform dimensionality reduction and clustering of biomolecular conformations in an unsupervised way. We show that GMVAE can learn a reduced representation of the free energy landscape of protein folding with highly separated clusters that correspond to the metastable states during folding. Since GMVAE uses a mixture of Gaussians as the prior, it can directly acknowledge the multi-basin nature of protein folding free-energy landscape. To make the model end-to-end differentialble, we use a Gumbel-softmax distribution. We test the model on three long-timescale protein folding trajectories and show that GMVAE embedding resembles the folding funnel with folded states down the funnel and unfolded states outer in the funnel path. Additionally, we show that the latent space of GMVAE can be used for kinetic analysis and Markov state models built on this embedding produce folding and unfolding timescales that are in close agreement with other rigorous dynamical embeddings such as time independent component analysis (TICA).

研究动机与目标

为解决从分子动力学模拟中获得的高维、高通量蛋白质折叠轨迹的解释难题。
开发一种无监督机器学习方法，实现对生物分子构象的联合降维与聚类。
在变分自编码器框架中使用高斯混合分布作为先验，捕捉蛋白质折叠自由能景观的多阱特性。
构建一个可微、端到端可训练的模型，以保留动力学信息，便于后续分析如马尔可夫状态建模。
在长时间折叠模拟上验证模型，并证明其能够重现已知的折叠动力学与结构状态。

提出的方法

该模型采用具有高斯混合先验的变分自编码器，用于在潜在空间中建模多模态数据分布，从而实现对亚稳态的聚类。
使用Gumbel-Softmax重参数化方法使离散聚类分配可微，从而实现通过随机采样层的端到端反向传播。
对归一化的Cα距离图应用卷积神经网络层，以平移不变的方式提取局部结构模式。
通过最小化重构损失和后验分布与先验分布之间的KL散度，优化潜在空间。
在训练后使用k近邻方法对聚类分配进行优化，将每个构象分配给其邻域中最可能的聚类。
在GMVAE嵌入空间上构建马尔可夫状态模型，以计算平均首达时间并验证动力学准确性。

实验结果

研究问题

RQ1像GMVAE这样的深度生成模型能否有效学习蛋白质折叠自由能景观的低维、可解释表示？
RQ2与使用单峰先验的标准VAE相比，使用高斯混合先验是否能实现对亚稳态更好的聚类效果？
RQ3GMVAE学习到的潜在空间是否能保留动力学信息，从而实现对折叠与展开时间尺度的精确估计？
RQ4与TICA等成熟方法相比，该模型在捕捉折叠漏斗与动力学转变方面表现如何？
RQ5模型的超参数选择（如聚类数量、嵌入维度）在多大程度上影响动力学预测的稳定性和准确性？

主要发现

GMVAE学习到的潜在空间呈现漏斗形，折叠态集中于底部，未折叠态分布在外围，与折叠漏斗模型一致。
潜在空间中的聚类对应于不同的结构状态：折叠态、错误折叠态和未折叠态，折叠聚类的RMSD分布低且狭窄，验证了其结构特异性。
对于Trp-cage蛋白，模型计算得到的折叠与展开平均首达时间分别为2.25 µs和1.54 µs，与DE Shaw团队报告的2.8 µs非常接近。
在三维潜在空间下，重构损失与交叉熵损失均被最小化，将维度增加至10仅带来微小改进。
该模型成功识别出关键折叠转变，如螺旋2的解折叠（S3 → S0），与已知的实验与模拟研究结果一致。
尽管训练过程中未使用延迟时间信息，GMVAE嵌入仍能实现精确的动力学建模，较慢过程的隐含时间尺度收敛良好；但较快的动力学过程（如Villin中的）需要更长的延迟时间才能可靠估计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。