QUICK REVIEW

[論文レビュー] Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences

Niklas Schmidinger, Lisa Schneckenreiter|arXiv (Cornell University)|Nov 6, 2024

Machine Learning in Bioinformatics被引用数 5

ひとこと要約

Bio-xLSTM は xLSTM アーキテクチャを DNA、タンパク質、SMILES に合わせて調整し、長文脈の生成モデル化、豊かな表現、および線形実行時間のシーケンスモデリングによる文脈内学習を可能にします。

ABSTRACT

Language models for biological and chemical sequences enable crucial applications such as drug discovery, protein engineering, and precision medicine. Currently, these language models are predominantly based on Transformer architectures. While Transformers have yielded impressive results, their quadratic runtime dependency on the sequence length complicates their use for long genomic sequences and in-context learning on proteins and chemical sequences. Recently, the recurrent xLSTM architecture has been shown to perform favorably compared to Transformers and modern state-space model (SSM) architectures in the natural language domain. Similar to SSMs, xLSTMs have a linear runtime dependency on the sequence length and allow for constant-memory decoding at inference time, which makes them prime candidates for modeling long-range dependencies in biological and chemical sequences. In this work, we tailor xLSTM towards these domains and propose a suite of architectural variants called Bio-xLSTM. Extensive experiments in three large domains, genomics, proteins, and chemistry, were performed to assess xLSTM's ability to model biological and chemical sequences. The results show that models based on Bio-xLSTM a) can serve as proficient generative models for DNA, protein, and chemical sequences, b) learn rich representations for those modalities, and c) can perform in-context learning for proteins and small molecules.

研究の動機と目的

Transformer ベースのアーキテクチャを超える生物学的および化学的シーケンスの長文脈言語モデルの動機づけと開発。
xLSTM をドメイン特化バリアント（DNA-xLSTM、Prot-xLSTM、Chem-xLSTM）へ適用し、生成、インペインティング、ICL を可能にする。
ゲノミクス、タンパク質、化学シーケンス課題における Bio-xLSTM を評価し、最先端ベースラインと比較する。
DNA に対する RC 等価性を実証し、分類および設計タスクでの下流性能を評価する。
ファインチューニングなしの文脈内学習能力とドメイン条件付き生成を示す。

提案手法

生物学的および化学的配列に適合させた sLSTM および mLSTM ブロックを組み込んで xLSTM を拡張する。
長距離依存性のためのコンテキスト窓と RoPE を用いた DNA-xLSTM、Prot-xLSTM、Chem-xLSTM の3つのドメイン特化バリアントを開発する。
因果言語モデリング(CLМ)、マスクド言語モデリング(MLM)、中間埋め(FIM)、文脈内学習(ICL) のモデリングモードを実装する。
後処理結合(PH) またはパラメータ共有(PS) による reverse-complement(RC) 等価性を組み込む。
ヒトゲノム上で RC 等価変種を用いて DNA-xLSTM を訓練し、HyenaDNA、Mamba、DNA-Mamba、Transformers と比較する。
FIM を用い、未整列ホモログに対して同源性認識を持つアラインメントフリー入力で Prot-xLSTM を訓練し、生成とバリアント適合度予測を評価する。
無条件の SMILES 生成とドメイン条件付き ICL のために Chem-xLSTM を訓練し、SMILES 実現性を評価する。
RoPE を用いた長文脈機能と、タンパク質で最大262kトークン、DNAで最大32kトークンまでのコンテキストサイズを評価する。

Figure 1: Overview of Bio-xLSTM. Top left: xLSTM for natural language processing tasks. Top right: Considered modeling approaches for biological sequences: masked language modeling, equivariance to reverse complementary sequence, and in-context learning. Bottom left: DNA-xLSTM models are trained on

実験結果

リサーチクエスチョン

RQ1Bio-xLSTM バリアントは線形のメモリスケーリングで長い生物学的および化学的シーケンスを効果的にモデル化できるか？
RQ2DNA-xLSTM、Prot-xLSTM、および Chem-xLSTM は、それぞれのタスクでドメイン特化型 Transformers や SSM ベースのモデルと競合する、または優れた性能を提供するか？
RQ3DNA モデリングと下流タスクにおいて RC 等価設計 (PH/PS) は有益か？
RQ4Prot-xLSTM は生成設計と残基レベル予測のためのホモログ条件付き文脈内学習を行えるか？
RQ5Chem-xLSTM はファインチューニングなしで分子生成のためのドメイン条件付き文脈内学習を可能にするか？

主な発見

Task	Metric	HyenaDNA	Mamba-PS a	Mamba-PH a	xLSTM-PS	xLSTM-PH
DNA downstream genomics	MCC	0.779±0.037	0.799±0.029	0.815±0.048	0.796±0.014	0.824±0.010
DNA downstream genomics	MCC	0.612±0.065	0.541±0.212	0.631±0.026	0.570±0.008	0.598±0.017
DNA downstream genomics	MCC	0.613±0.041	0.609±0.109	0.625±0.129	0.588±0.019	0.625±0.010
DNA downstream genomics	MCC	0.512±0.024	0.488±0.102	0.523±0.039	0.490±0.012	0.526±0.001
DNA downstream genomics	MCC	0.455±0.095	0.388±0.101	0.487±0.170	0.489±0.024	0.504±0.012
DNA downstream genomics	MCC	0.549±0.056	0.440±0.202	0.544±0.045	0.520±0.019	0.537±0.012
DNA downstream genomics	MCC	0.581±0.061	0.604±0.048	0.622±0.030	0.622±0.013	0.627±0.008
DNA downstream genomics	MCC	0.763±0.044	0.789±0.020	0.811±0.022	0.793±0.011	0.813±0.008
DNA downstream genomics	MCC	0.564±0.038	0.525±0.240	0.621±0.054	0.558±0.018	0.583±0.014
DNA downstream genomics	MCC	0.517±0.117	0.491±0.066	0.546±0.073	0.375±0.030	0.545±0.024
DNA downstream genomics	MCC	0.386±0.185	0.416±0.095	0.439±0.054	0.444±0.046	0.466±0.011
DNA downstream genomics	F1 (Promoter: All)	0.960±0.005	0.967±0.004	0.970±0.004	0.962±0.002	0.967±0.001
DNA downstream genomics	F1 (NonTATA)	0.959±0.011	0.968±0.006	0.968±0.010	0.963±0.002	0.970±0.001
DNA downstream genomics	F1 (TATA)	0.944±0.040	0.957±0.015	0.953±0.016	0.948±0.006	0.952±0.005
DNA downstream genomics	Accuracy (Splice Site All)	0.956±0.011	0.927±0.021	0.940±0.027	0.965±0.006	0.974±0.004
DNA downstream genomics	Accuracy (Donor)	0.949±0.024	0.874±0.289	0.948±0.025	0.962±0.004	0.951±0.005

DNA-xLSTM はヒトゲノムで 2M パラメータの CLM および MLM の事前学習で、Transformers、Mamba、HyenaDNA を上回る。
DNA-xLSTM-2M (PH/PS) は 18 件中 12 件の下流ゲノム分類タスクでベースラインと同等以上を、2M 未満のパラメータで達成。
Prot-xLSTM-102M はホモログ条件付きのタンパク質生成で優れた perplexity と生成品質を達成し、長い文脈で特に ProtMamba および Transformer++ のベースラインを上回る。
Prot-xLSTM-102M は合計訓練トークン数が fewer だが ProtMamba-107M を上回り、長文脈学習の効率性を示す。
Chem-xLSTM は無条件の SMILES 生成で最も低い Fréchet ChemNet Distance(FCD) を達成し、また perplexity も競争力があることから、現実的な化学出力を示す。

Figure 2: Pre-training of 2M-parameter DNA models on the human reference genome (GRCh38). Models are trained at single-nucleotide resolution with a context length of 1024 bases. Left: causal language modeling . Learning curves display NTP loss ( $\downarrow$ ) on a test set, plotted against the numb

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。