QUICK REVIEW

[논문 리뷰] Tx-LLM: A Large Language Model for Therapeutics

Juan Manuel Zambrano Chaves, E.-W. Wang|arXiv (Cornell University)|2024. 06. 10.

Natural Language Processing Techniques인용 수 11

한 줄 요약

Tx-LLM은 PaLM-2에서 미세 조정된 일반 목적 LLM으로 다양한 치료 모달리티와 작업에 걸친 지식을 인코딩하고 단일 모델로 많은 약물 발견 벤치마크에서 경쟁력 있는 또는 최첨단에 준하는 성능을 달성합니다.

ABSTRACT

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

연구 동기 및 목표

치료 개발 파이프라인의 여러 단계를 지원하기 위한 단일 일반 목적 AI의 가능성을 촉진한다.
작업별 미세조정 없이 교차 작업 성능을 향상시키기 위해 다양한 치료제 데이터로 LLM를 훈련한다.
TDC 작업 집합 전반에 걸쳐 경쟁력 있거나 우수한 성능을 입증한다.
다른 약물 유형의 데이터 세트 간의 긍정적 전달과 모델 크기, 미세조정, 프롬프트의 효과를 조사한다.

제안 방법

TxT에 대해 PaLM-2 기본 모델을 미세조정한다. 이는 Therapeutics Data Commons (TDC)에서 66개 작업을 다루는 709개 데이터세트를 모은 컬렉션이다.
치료제를 문자열(SMILES, 시퀀스, 텍스트)로 표현하고 분류, 회귀, 생성 작업의 프롬프트에 자유 텍스트와 섞어 사용한다.
훈련 시 0-shot과 few-shot 프롬프트를 혼합하여 사용하며(70% 0-shot, 30% few-shot) 무작위로 샷을 선택한다.
데이터세트 크기에 비례하는 혼합 비율로 모든 데이터세트에 걸쳐 단일 모델을 훈련한다; S 및 M 모델 변형을 탐색한다.
작업에 적합한 지표(AUROC, AUPRC, 정확도, Spearman/Pearson 상관계수, MAE, MSE, USPTO 생성 정확도)로 평가한다.
모델 크기, 도메인 미세조정, 프롬프트 전략, 맥락 존재 여부에 대한 제거 실험을 수행하여 성능에 미치는 영향을 평가한다.

Figure 1 : Overview of the Tx-LLM. (top) Datasets from the Therapeutic Data Commons are used to construct the Therapeutics instruction Tuning (TxT) collection. The original tabular datasets contain a variety of drug types including small molecules, macro-molecules such as proteins and nucleic acids,

실험 결과

연구 질문

RQ1한 개의 일반 목적 LLM이 분자, 단백질, 핵산, 세포 및 질병에 걸친 다양한 치료 작업을 수행하도록 학습할 수 있는가?
RQ2도메인 미세조정과 더 큰 모델 크기가 Therapeutics Data Commons 작업 전반의 성능을 향상시키는가?
RQ3다른 약물 유형의 데이터세트 간에 긍정적 전달이 존재하는가, 그리고 프롬프트 전략은 결과에 어떤 영향을 미치는가?
RQ4맥락 정보를 제공하는 프롬프트가 광범위한 치료 작업의 작업 성능을 향상시키는가?

주요 결과

Tx-LLM은 66개 작업 중 43개에서 최첨단에 근접하거나 이를 능가했고, 22개 작업에서 SOTA를 초과했다.
SMILES와 텍스트를 결합한 데이터세트(예: 질병 또는 세포주 이름)의 경우, 학습된 컨텍스트로 인해 평균적으로 SOTA를 능가하는 경향이 있다.
긍정적 전달의 증거: 다양한 약물 유형의 데이터세트에서 학습하면 소분자 데이터세트의 성능이 향상된다.
모델 규모와 도메인 미세조정이 성능을 크게 향상시키며, 더 큰 모델과 미세조정 변형이 많은 작업에서 기준선을 앞지른다.
맥락화된 프롬프트는 성능을 크게 향상시키며, 맥락을 제거하면 대부분의 데이터세트에서 정확도가 감소한다.

Figure 2 : Tx-LLM may be effective for end-to-end therapeutic development. Tx-LLM is a single model that can be queried for multiple steps of the therapeutic development process, covering tasks from early-stage target discovery to late-stage clinical trial approval. We list example tasks associated

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.