QUICK REVIEW

[论文解读] GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Pengfei Liu, Yiming Ren|arXiv (Cornell University)|Aug 14, 2023

Computational Drug Discovery Methods参考文献 47被引用 8

一句话总结

GIT-Mol 是一个700M的多模态LLM，结合图、图像和文本以提升分子描述、文本驱动分子生成、图像识别和性质预测，核心在于 GIT-Former 模态混合与 Xmodal 预训练策略。

ABSTRACT

Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.

研究动机与目标

点明并解决仅文本的LLM在充分利用分子图与图像方面的局限性。
开发 GIT-Mol，将图、图像和文本模态整合到统一的潜在空间中。
提出带有跨注意力的 GIT-Former，以融合模态并实现任意到语言的翻译。
在分子描述、从头生成、图像识别和性质预测方面展示改进。
给出消融和分析，验证每种模态及训练策略的贡献。

提出的方法

引入 GIT-Former，一种基于跨模态注意力的模态混合器，将图、图像和文本映射到统一的潜在空间。
使用特定模态的编码器（文本使用 MolT5，图像使用 Swin Transformer，图使用 GIN），并使用 MolT5 解码器完成生成任务。
通过 Xmodal-Text Matching（XTM）和 Xmodal-Text Contrastive Learning（XTC）进行预训练以对齐模态。
在微调阶段对模态翻译任务应用任意到语言的提示。
在 MoleculeNet 属性任务上进行微调，并对语言输出使用提示微调。

Figure 1: An overview of GIT-Mol . (a) Internal Information , including sequence and graph structure representations, emphasizes inherent chemical properties and simple topology; (b) External Information , e.g., images and text descriptions, provide richer details and help the human understanding; (

实验结果

研究问题

RQ1GIT-Former 是否能够有效将图、图像和文本模态对齐到分子任务的共享潜在空间？
RQ2与单一模态相比，多模态输入是否提升分子描述、基于图像的识别和 SMILES 生成的性能？
RQ3XTM 与 XTC 训练策略对跨模态对齐和下游性能的影响如何？
RQ4提示学习如何影响任意到语言的模态翻译和性质预测？
RQ5在分子性质预测准确性和分子生成有效性方面，GIT-Mol 的提升有哪些？

主要发现

模型	BLEU-2	BLEU-4	ROUGh-1	ROUGh-2	ROUGh-L	METEOR
SciBERT	0.184	0.113	0.412	0.327	0.397	0.367
MolT5-base	0.316	0.247	0.572	0.480	0.545	0.529
GIT-Mol(SMILES)	0.264	0.176	0.477	0.374	0.451	0.430
GIT-Mol(Graph)	0.290	0.210	0.540	0.445	0.512	0.491
GIT-Mol(XTM)	0.264	0.187	0.521	0.421	0.494	0.471
GIT-Mol op	0.312	0.237	0.556	0.468	0.535	0.525
GIT-Mol	0.352	0.263	0.575	0.485	0.560	0.533

GIT-Mol 在各项指标上均优于单一模态基线的描述性能。
基于图的变体通常在描述指标上优于 SMILES，且多模态超越两者。
消融实验显示多模态比单模态提高约10–15%。
在从头生成方面，GIT-Mol+MolT5 的描述结果具有更高的有效性（0.928）和具竞争力的相似性指标。
跨模态预训练（先 XTM，再 XTC）和提示学习对结果有显著影响。
在跨模态分子生成和性质预测任务上，GIT-Mol 在若干指标上优于基线。

Figure 2: Study case of Molecule Caption . The GIT-Mol model exhibits precise chemical characterization, aligning closely with ground truth information.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。