QUICK REVIEW

[論文レビュー] A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

Bing Su, Dazhao Du|arXiv (Cornell University)|Sep 12, 2022

Computational Drug Discovery Methods被引用数 39

ひとこと要約

MoMuは、グラフと言語表現を橋渡しするために、対になった分子グラフと関連テキスト上で事前学習された分子のマルチモーダル基盤モデルであり、クロスモーダル検索、分子キャプション、ゼロショットのテキストからグラフ生成、そして特性予測の改善を可能にします。

ABSTRACT

Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language descriptions. We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine, among others.

研究の動機と目的

モレナ――bridging molecule graphs and natural language to enable comprehensive molecular understanding across modalities.
Pretrain a dual-encoder model on paired graph-text data to align graph and text representations.
Demonstrate downstream capabilities including cross-modal retrieval, molecule captioning, zero-shot text-to-graph generation, and property prediction.

提案手法

Use two encoders (Graph Isomorphism Network for graphs, BERT variants for text) to map molecules into a shared representation space.
Create 15,613 graph-document pairs by linking PubChem molecule graphs with related SCI-paper text retrieved from S2orc.
Apply two graph augmentations and four cross-modal contrastive losses in a MoMu multi-view training setup inspired by GraphCL.
Initialize graph encoder with GraphGIN weights and text encoder with Sci-BERT or KV-PLM to bootstrap training.
Train with inter- and intra-modal contrastive learning using InfoNCE loss to align graph and text representations.
Evaluate cross-modal retrieval (graph-to-text and text-to-graph) on PCdes and perform zero-shot retrieval tests; assess text-to-graph alignment and generation capabilities.
Demonstrate molecule caption improvements by incorporating MoMu graph features into MolT5-based captioning.
Propose zero-shot text-to-graph molecule generation by optimizing a latent vector in MoFlow’s generator conditioned on cross-modal similarity with MoMu representations.

実験結果

リサーチクエスチョン

RQ1Can a joint graph-text encoder align molecular graphs with natural language descriptions in a shared embedding space?
RQ2Can the MoMu representations support cross-modal retrieval and captioning, and enable zero-shot text-to-graph molecule generation?
RQ3Does multimodal pretraining improve molecular property prediction compared to single-modality pretraining?
RQ4Is zero-shot generation able to produce diverse molecules matching high-level textual descriptions?

主な発見

MoMu outperforms baselines on graph-to-text and text-to-graph retrieval, including zero-shot scenarios.
MoMu-based graph features improve MolT5 captioning metrics on the ChEBI-20 dataset.
MoMu pretraining yields superior molecule property prediction across MoleculeNet datasets on average.
Zero-shot text-to-graph generation can produce diverse molecules that meet described conditions, leveraging MoMu and MoFlow.
Graph encoders initialized with multimodal pretraining outperform single-modality initialization in downstream tasks.
MoMu representations show clearer separation of properties in t-SNE visualizations after fine-tuning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。