Skip to main content
QUICK REVIEW

[论文解读] Learning Molecular Representation in a Cell

Gang Liu, Srijit Seal|PubMed|Jun 17, 2024
Computational Drug Discovery Methods被引用 7
一句话总结

InfoAlign 通过在上下文图中将分子结构与细胞反应数据整合,学习最小充分的分子表征,使用带有多个解码器的信息瓶颈以与相邻生物特征对齐,从而提升分子性质预测和零-shot 分子–形态匹配。

ABSTRACT

Predicting drug efficacy and safety <i>in vivo</i> requires information on biological responses (e.g., cell morphology and gene expression) to small molecule perturbations. However, current molecular representation learning methods do not provide a comprehensive view of cell states under these perturbations and struggle to remove noise, hindering model generalization. We introduce the <b>Infor</b>mation <b>Align</b>ment (InfoAlign) approach to learn molecular representations through the information bottleneck method in cells. We integrate molecules and cellular response data as nodes into a context graph, connecting them with weighted edges based on chemical, biological, and computational criteria. For each molecule in a training batch, InfoAlign optimizes the encoder's latent representation with a minimality objective to discard redundant structural information. A sufficiency objective decodes the representation to align with different feature spaces from the molecule's neighborhood in the context graph. We demonstrate that the proposed sufficiency objective for alignment is tighter than existing encoder-based contrastive methods. Empirically, we validate representations from InfoAlign in two downstream applications: molecular property prediction against up to 27 baseline methods across four datasets, plus zero-shot molecule-morphology matching.

研究动机与目标

  • 通过将细胞反应(细胞形态和基因表达)与分子结构结合,推动整体性分子表征学习。
  • 开发基于上下文图的框架,将分子与细胞扰动连接起来,并学习鲁棒的瓶颈表示。
  • 展示解码器基对齐相对于编码器基对比学习方法的理论与实证优势。
  • 在多个数据集上评估 InfoAlign 在分子性质预测和零样本分子–形态匹配方面的性能。

提出的方法

  • 构建一个细胞上下文图,将分子、细胞形态和基因表达作为节点,边按化学、生物和计算标准加权。
  • 在上下文图上使用随机游走来确定训练分子 X 的邻域节点。
  • 训练一个编码器 f_theta 从 X 产生潜在表示 Z,同时使用多个解码器 g_phi 来重建沿游走路径的邻近节点的特征。
  • 用最小性目标 I(X;Z) 和充分性目标 ∑_{v ∈ P_X} I(Z; ψ(v)) 进行优化,利用变分边界(I_DLB 和 I_EUB)以及交叉熵损失加 KL 正则化来近似。
  • 论证解码器基边界 I_DLB 相较于编码器基 InfoNCE 边界 I_ELB 提供更紧致的互信息界。
  • 在下游任务上微调编码器/解码器,包括分子性质预测和零-shot 分子–形态匹配。
Figure 1: Molecular Representation Learning via the Information Bottleneck: (a) Existing contrastive learning methods utilize two encoders—one for molecules and another for cell morphology or gene expression features, lacking a holistic view of molecular representation learning in cells. (b) In cont
Figure 1: Molecular Representation Learning via the Information Bottleneck: (a) Existing contrastive learning methods utilize two encoders—one for molecules and another for cell morphology or gene expression features, lacking a holistic view of molecular representation learning in cells. (b) In cont

实验结果

研究问题

  • RQ1InfoAlign 是否能通过去除冗余信息、同时保留足够的生物信号来使分子表征在多模态间泛化?
  • RQ2基于上下文图的多解码器信息瓶颈是否能在分子性质预测和零样本跨模态匹配方面超越仅使用编码器的对比学习方法?
  • RQ3诸如游走长度和先验强度等超参数如何影响最小性与充分性之间的平衡以及下游性能?

主要发现

  • InfoAlign 在分子性质预测的三个分类数据集和一个回归数据集上,优于多达 19 个基线。
  • 相较于最佳基线,InfoAlign 在 Broad6K 分类上提升了 +10.58%,在 Biogen3K 回归上提升了 +6.33%。
  • InfoAlign 在两个分子–形态数据集上展示了强劲的零样本分子–形态匹配,在多种设置中超越 CLOOME 与 InfoCORE。
  • 解码器基对齐提供了比编码器基 InfoNCE 边界更紧的互信息界,支持所提方法的理论优势。
  • 在实验中,细胞形态和基因表达特征与分子结构互补,InfoAlign 捕捉到跨模态的瓶颈表示,从而实现更好的泛化。
Figure 2: Representation Learning Over Walk Paths in Context Graphs: (a) In Section 4.1 , we construct the graph with various interaction, perturbation, and cosine similarities among molecules $X$ , cell morphology profiles $C$ , and gene expression profiles $E$ . Given a training batch of molecules
Figure 2: Representation Learning Over Walk Paths in Context Graphs: (a) In Section 4.1 , we construct the graph with various interaction, perturbation, and cosine similarities among molecules $X$ , cell morphology profiles $C$ , and gene expression profiles $E$ . Given a training batch of molecules

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。