QUICK REVIEW

[논문 리뷰] Unifying Molecular and Textual Representations via Multi-task Language Modelling

Dimitrios Christofidellis, Giorgio Giannone|arXiv (Cornell University)|2023. 01. 29.

Machine Learning in Materials Science인용 수 25

한 줄 요약

이 논문은 Text+Chem T5를 소개하는 다도메인, 다태스크 트랜스포머로서 자연어와 화학 언어를 연결하고, 태스크 특정 파인튜닝이나 이중 도메인 사전학습 없이 교차 도메인 작업을 가능하게 하며, 인코더 공유 및 스케일링으로 성능 향상을 보인다.

ABSTRACT

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains. Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

연구 동기 및 목표

자연어 표현과 화학 표현을 하나의 다태스크 모델로 연결한다.
비용이 큰 단일 도메인 사전학습과 태스크별 파인튜닝의 필요성을 제거한다.
화학 및 NLP 벤치마크에서 교차 도메인 및 교차 태스크 능력을 보여준다.
교차 도메인 전이를 극대화하기 위한 인코더 공유 및 집계 전략을 분석한다.
더 큰 모델 크기가 교차 도메인 작업에서 가지는 확장성 이점을 보여준다.

제안 방법

백본으로 T5 기반의 인코더–디코더 아키텍처를 사용한다.
태스크 프롬프트를 사용하여 단일 모델을 단일 도메인(텍스트 또는 화학)과 교차 도메인 작업에서 공동으로 학습시킨다.
도메인 간 인코더를 공유하고 도메인 특화 인코더 간의 교차 주의를 통한 병합 전략을 탐구한다.
mol2mol, text2mol, mol2text, text2text 작업에서 공유 인코더 vs 도메인 특화 인코더, 고정 vs 미세조정 여부 등 다양한 인코더 전략을 평가한다.
Transformer, T5 파인튜닝 모델, RXN 계열 모델, MolT5를 포함한 기준선과 비교한다.
작업 분포의 균형을 맞추고 데이터 규모의 영향을 평가하기 위해 보강 데이터 변형을 활용한다.

실험 결과

연구 질문

RQ1하나의 다태스크, 다도메인 모델이 화학 및 NLP 작업에서 경쟁력 있는 성능을 달성하고 교차 도메인 작업에서 단일 도메인 기준선을 능가할 수 있는가?
RQ2인코더 공유 및 교차 도메인 정보 공유가 교차 도메인 번역 성능을 향상시키는가?
RQ3태스크 특화 헤드나 광범위한 단일 도메인 사전학습 없이도 모델이 교차 도메인 작업을 수행할 수 있는가?
RQ4모델 크기(small vs base)가 교차 도메인 성능 및 확장성에 어떤 영향을 미치는가?
RQ5집계 전략의 영향이 교차 도메인 작업 성능에 미치는 영향은 무엇인가?

주요 결과

Text+Chem T5가 여러 메트릭(BLEU, ROUGE, METEOR 등)에서 교차 도메인 작업에 대해 기준선보다 우수한 성능을 보인다.
인코더 공유 및 미세튜닝으로 교차 도메인 작업의 성능이 크게 향상되며 text2mol 및 mol2text에서 MD e^2-CLM 변형보다 우수하다.
Text+Chem T5는 작은 규모와 기본 규모 모두에서 mol2text(Caption from SMILES) 및 text2mol(SMILES from text)에서 최상위 점수를 달성한다.
모델 크기의 증가는 Text+Chem T5에 대해 기준선보다 더 빠르고 큰 이득을 가져오며, 도메인 간 정보를 공유할 때 확장성이 향상됨을 시사한다.
v별 중단 연구 결과 인코더 공유 및 미세튜닝이 교차 도메인 성능에 가장 큰 영향을 미치고, 집계 방법은 덜 중요하다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.