QUICK REVIEW

[論文レビュー] Unifying Molecular and Textual Representations via Multi-task Language Modelling

Dimitrios Christofidellis, Giorgio Giannone|arXiv (Cornell University)|Jan 29, 2023

Machine Learning in Materials Science被引用数 25

ひとこと要約

本論文は Text+Chem T5 を提案する。これは自然言語と化学言語を橋渡しするマルチドメイン・マルチタスクのトランスフォーマで、タスク特異的なファインチューニングや二重ドメイン事前学習なしにクロスドメインタスクを実行可能とし、エンコーダ共有とスケーリングにより性能向上を示す。

ABSTRACT

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

研究の動機と目的

1つのマルチタスクモデルで自然言語と化学表現を橋渡しする。
高価なモノドメイン事前学習とタスク特異的ファインチューニングの必要性を排除する。
化学とNLPのベンチマーク全体でクロスドメインおよびクロスタスクの能力を示す。
クロスドメイン転送を最大化するためのエンコーダ共有と集約戦略を分析する。
クロスドメインタスクにおける大規模モデルサイズのスケーラビリティ利点を示す。

提案手法

T5ベースのエンコーダ–デコーダアーキテクチャをバックボーンとして用いる。
タスクプロンプトを用いて、モノドメイン（テキストまたは化学）とクロスドメインタスクの両方を1つのモデルで共同訓練する。
ドメイン間でエンコーダを共有し、ドメイン固有のエンコーダ間のクロスアテンションを介した統合戦略を検討する。
mol2mol, text2mol, mol2text, and text2text タスクを横断して、共有型/ドメイン特有型、凍結/ファインチューニングなど複数のエンコーダ戦略を評価する。
Transformer、T5ファインチューニングモデル、RXN ファミリーモデル、MolT5 などのベースラインと比較する。
タスク分布のバランスを取るために拡張データ変種を利用し、データスケールの影響を評価する。

実験結果

リサーチクエスチョン

RQ11つのマルチタスク・マルチドメインモデルは、化学とNLPの両方のタスクで競争力のある性能を達成し、クロスドメインタスクでモノドメインのベースラインを上回ることができるか。
RQ2エンコーダ共有とクロスドメイン情報共有はクロスドメイン翻訳性能を改善するか。
RQ3タスク固有のヘッドや大規模なモノドメイン事前学習なしで、モデルはクロスドメインタスクを実行できるか。
RQ4モデルサイズ（小型 vs ベース）はクロスドメイン性能とスケーラビリティにどう影響するか。
RQ5集約戦略がクロスドメインタスクの性能に与える影響は何か。

主な発見

Model	Agg	Enc-sharing	Enc-tuning	text2mol	mol2text
MD e^2-CLM	mean	✗	✗	0.572	0.123
MD e^2-CLM	cross-att	✗	✗	0.702	0.274
MDMT e^2-CLM	cross-att	✗	✗	0.247	0.119
MDMT e^2-CLM	cross-att	✗	✓	0.211	0.075
Text+Chem T5	-	✓	✓	0.750	0.580
Text+Chem T5-augm	-	✓	✓	0.853	0.625

Text+Chem T5 は複数の指標（例：BLEU、ROUGE、METEOR など）でクロスドメインタスクのベースラインを上回る。
エンコーダ共有とファインチューニングによりクロスドメインタスクで有意な利得を示し、text2mol および mol2text で MD e^2-CLM 系の変種を上回る。
Text+Chem T5 は small および base サイズで mol2text（SMILES からキャプション）と text2mol（テキストから SMILES）で最高スコアを達成する。
モデルサイズの増加は、Text+Chem T5 にとってベースラインよりも速く大きな利得を生み、ドメイン間で情報を共有する際のスケーリングが改善されていることを示す。
アブレーションにより、エンコーダ共有とファインチューニングがクロスドメイン性能に最も影響力が大きく、集約手法はそれほど重要ではないことが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。