QUICK REVIEW

[論文レビュー] Tx-LLM: A Large Language Model for Therapeutics

Juan Manuel Zambrano Chaves, E.-W. Wang|arXiv (Cornell University)|Jun 10, 2024

Natural Language Processing Techniques被引用数 11

ひとこと要約

Tx-LLM は PaLM-2 から微調整された汎用的な LLM で、さまざまな治療モダリティとタスクにまたがる知識をエンコードし、単一のモデルを用いて多くの薬物探索ベンチマークで競争力のある、あるいは最先端と同等の性能を達成します。

ABSTRACT

Developing therapeutics is a lengthy and expensive process that requires the satisfaction of many different criteria, and AI models capable of expediting the process would be invaluable. However, the majority of current AI approaches address only a narrowly defined set of tasks, often circumscribed within a particular domain. To bridge this gap, we introduce Tx-LLM, a generalist large language model (LLM) fine-tuned from PaLM-2 which encodes knowledge about diverse therapeutic modalities. Tx-LLM is trained using a collection of 709 datasets that target 66 tasks spanning various stages of the drug discovery pipeline. Using a single set of weights, Tx-LLM simultaneously processes a wide variety of chemical or biological entities(small molecules, proteins, nucleic acids, cell lines, diseases) interleaved with free-text, allowing it to predict a broad range of associated properties, achieving competitive with state-of-the-art (SOTA) performance on 43 out of 66 tasks and exceeding SOTA on 22. Among these, Tx-LLM is particularly powerful and exceeds best-in-class performance on average for tasks combining molecular SMILES representations with text such as cell line names or disease names, likely due to context learned during pretraining. We observe evidence of positive transfer between tasks with diverse drug types (e.g.,tasks involving small molecules and tasks involving proteins), and we study the impact of model size, domain finetuning, and prompting strategies on performance. We believe Tx-LLM represents an important step towards LLMs encoding biochemical knowledge and could have a future role as an end-to-end tool across the drug discovery development pipeline.

研究の動機と目的

治療開発パイプラインの複数段階を支援する単一の汎用AIを動機づける。
タスク固有の微調整を行わず、異なる治療データでLLMを訓練して、クロスタスクの性能を向上させる。
TDC タスクの広範なセットで競争力ある、あるいはそれを上回る性能を示す。
異なる薬物タイプのデータセット間の正の転移と、モデルサイズ・微調整・プロンプティングの効果を調査する。

提案手法

TxT に含まれる PaLM-2 ベースモデルをファインチューニングする。TxT は Therapeutics Data Commons (TDC) の 66 タスクを含む 709 データセットのコレクション。
治療データを文字列（SMILES、配列、テキスト）として表現し、分類・回帰・生成タスクのプロンプトで自由テキストと組み合わせて使用する。
訓練中には0ショットとFew-shotプロンプトを混在させる（70% 0-shot、30% few-shot、シャットはランダムに選択）。
データセットサイズに比例した混合比で、すべてのデータセットを横断して単一モデルを訓練する。SモデルとMモデルのバリアントを検討する。
タスクに適した指標（AUROC、AUPRC、精度、Spearman/Pearson 相関、MAE、MSE、USPTO 生成精度）を用いて評価する。
モデルサイズ、ドメイン微調整、プロンプティング戦略、コンテキストの有無に関するアブレーションを実施し、性能への影響を評価する。

Figure 1 : Overview of the Tx-LLM. (top) Datasets from the Therapeutic Data Commons are used to construct the Therapeutics instruction Tuning (TxT) collection. The original tabular datasets contain a variety of drug types including small molecules, macro-molecules such as proteins and nucleic acids,

実験結果

リサーチクエスチョン

RQ1単一の汎用LMMは、分子、タンパク質、核酸、細胞、疾患を横断する多様な治療タスクを学習して実行できるだろうか？
RQ2ドメイン微調整とより大きなモデルサイズは、Therapeutics Data Commons のタスク全体の性能を向上させるだろうか？
RQ3異なる薬物タイプのデータセット間で正の転移はあるか、プロンプティング戦略は結果にどう影響するか？
RQ4広範な治療タスクに対して、文脈情報（プロンプト）を提供することはタスク性能を向上させるか？

主な発見

Tx-LLM は 66 タスク中 43 タスクでほぼ最先端、またはそれを上回る性能を達成し、22 タスクで SOTA を超えた。
SMILES とテキストを組み合わせたデータセット（例：疾病名や細胞株名）では、学習した文脈により平均的に SOTA を上回る傾向がある。
正の転移の証拠：多様な薬物タイプデータセットで訓練すると、小分子データセットの性能が向上する。
モデル規模とドメイン微調整は性能を大幅に向上させる。より大きなモデルと微調整版は、多くのタスクでベースラインを上回る。
文脈付きプロンプトは性能を大幅に向上させる。文脈を取り除くと、ほとんどのデータセットで精度が低下する。

Figure 2 : Tx-LLM may be effective for end-to-end therapeutic development. Tx-LLM is a single model that can be queried for multiple steps of the therapeutic development process, covering tasks from early-stage target discovery to late-stage clinical trial approval. We list example tasks associated

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。