QUICK REVIEW

[論文レビュー] From Words to Molecules: A Survey of Large Language Models in Chemistry

Liao Chang, Yemin Yu|arXiv (Cornell University)|Feb 2, 2024

Machine Learning in Materials Science被引用数 6

ひとこと要約

この調査は、化学における大規模言語モデル（LLMs）の適用方法をカテゴリ化し、分子表現、トークン化、事前学習目的、適用パラダイムを詳述する。将来の研究方向も概説する。

ABSTRACT

In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interpretability, paving the way for groundbreaking developments in the field.

研究の動機と目的

LLMsのための分子情報がどのようにトークン化され、表現されるかを体系的にレビューする。
入力ドメインとモダリティに基づく化学 LLM の分類法（単一ドメイン、複数ドメイン、マルチモーダル）を提供する。
化学データのための事前学習目的と適応戦略を分析する。
LLMs によって可能となる多様な化学応用を探求し、未解決の研究方向を特定する。

提案手法

分子表現を、指紋、SMILES/SELFIES、InChI、グラフベースの表現として分類し、トークン化レベルを文字-、原子-、モチーフ-レベルで整理する。
単一ドメイン、複数ドメイン、マルチモーダルという事前学習データドメインの分類と統合戦略を提示する。
化学 LLM の3つの核⼼的な事前学習目的をレビューする：Masked Language Modeling (MLM)、Molecule Property Prediction (MPP)、Autoregressive Token Generation (ATG) で、化学特有のタスクを用いる。
クロスモーダル objectives such as cross-modal contrastive learning (XMC) and alignment across modalities.
方法の包括的な表として、代表的なアーキテクチャ、データセット、訓練アプローチを要約する。
継要 Applications and future directions including continual learning and interpretability.

実験結果

リサーチクエスチョン

RQ1化学において分子列はどのようにトークン化され、LLMs に対して表現されるのか？
RQ2入力ドメインとモダリティによって化学 LLM を最も適切に表現する分類法は何か？
RQ3どの事前学習目的が用いられ、化学データにどのように適応されているか？
RQ4化学 LLM によって可能になる主要な応用とパラダイムは何か？
RQ5化学知識の統合、継続学習、および解釈性の向上に向けた今後の方向性は何か？

主な発見

分子表現には、指紋、SMILES/SELFIES、InChI、グラフベースの形式があり、粒度はさままざま。
トークン化スキームは文字レベル、原子レベル、モチーフレベルのアプローチを含み、データ駆動型および化学駆動型の手法がある。
化学 LLM は、入力データとモダリティに応じて単一ドメイン、複数ドメイン、マルチモーダルの分類に組織されている。
MLM、MPP、ATG がコアな事前学習目的であり、MPP は強力な表現学習信号を提供し、ATG はタスクの整合性を可能にする。
クロスモーダル学習と表現の整合性は、テキスト、グラフ、指紋、画像を融合するために用いられているが、ドメイン特有のニュアンスは依然課題である。
応用には、チャットボット、文脈内学習、性質予測、反応予測、分子生成などの下流タスクに向けた表現学習が含まれる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。