QUICK REVIEW

[論文レビュー] InstructProtein: Aligning Human and Protein Language via Knowledge Instruction

Zeyuan Wang, Qiang Zhang|arXiv (Cornell University)|Oct 5, 2023

Topic Modeling被引用数 8

ひとこと要約

InstructProtein は、タンパク質と自然言語の両方のコーパスで事前学習し、知識グラフベースの指示データセットでファインチューニングを行うことにより、人間の言語とタンパク質言語の間の双方向生成を達成する LLM です。

ABSTRACT

Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins. To address this challenge, we propose InstructProtein, an innovative LLM that possesses bidirectional generation capabilities in both human and protein languages: (i) taking a protein sequence as input to predict its textual function description and (ii) using natural language to prompt protein sequence generation. To achieve this, we first pre-train an LLM on both protein and natural language corpora, enabling it to comprehend individual languages. Then supervised instruction tuning is employed to facilitate the alignment of these two distinct languages. Herein, we introduce a knowledge graph-based instruction generation framework to construct a high-quality instruction dataset, addressing annotation imbalance and instruction deficits in existing protein-text corpus. In particular, the instructions inherit the structural relations between proteins and function annotations in knowledge graphs, which empowers our model to engage in the causal modeling of protein functions, akin to the chain-of-thought processes in natural languages. Extensive experiments on bidirectional protein-text generation tasks show that InstructProtein outperforms state-of-the-art LLMs by large margins. Moreover, InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design, effectively bridging the gap between protein and human language understanding.

研究の動機と目的

単一の LLM で人間の言語とタンパク質言語のギャップを橋渡しする。
双方向生成を可能にする：シーケンスからタンパク質機能を予測し、自然言語プロンプトからシーケンスを生成する。
知識グラフを用いたアノテーションの不均衡とタンパク質-テキストコーパスの指示信号不足に対処する。
監督付き指示調整のための高品質なタンパク質指示データセットを提供する。
ゼロショットのタンパク質理解とシーケンス設計タスクの改善を示す。

提案手法

タンパク質シーケンスと自然言語コーパスの多言語事前学習。
UniProtKB からタンパク質知識グラフを構築し、Knowledge Causal Modeling（KCM）を用いて注釈間の因果関係を符号化する。
シーケンスと特性の類似性に基づきアノテーションをバランスさせるためのデバイアス補正 KG 三重項サンプリング。
サンプリングしたKG三重項をKG補完タスクとLLMs（例：ChatGPT）を用いて指示データに変換する。
事前学習済みモデルを生成されたタンパク質知識指示でファインチューニングし、人間と言語タンパク質言語の整合を図る。
双方向タスクで評価する：タンパク質シーケンス理解とゼロショット下の設計。

実験結果

リサーチクエスチョン

RQ1タンパク質シーケンスと自然言語の両方で学習した単一の LLM は、人間言語とタンパク質言語の間を双方向に生成できるか？
RQ2デバイアス付きサンプリングによるKGベースの指示生成は、ベースライン LLM と比較してゼロショットのタンパク質理解と設計を改善するか？
RQ3指示調整モデルが、機能や構造関連のプロンプトを満たすタンパク質シーケンスをどの程度生成できるか？
RQ4Knowledge Causal Modeling（KCM）を組み込むことは、指示信号とモデル性能を高めるか？

主な発見

InstructProtein は複数のベースラインと比較してタンパク質シーケンス理解のゼロショット性能で最先端を達成。
デバイアス付きサンプリングとKCMを用いたKGベースの指示生成フレームワークは、指示品質とタンパク質と言語間のモデル整合性を改善。
本モデルは双方向生成を示し、自然言語指示と機能重視のプロンプトに導かれたタンパク質設計を可能にする。
アブレーション研究は、タンパク質特性に基づくクラスタリングとKCMの含有が性能向上に寄与することを示す。
タンパク質シーケンスのde novo設計実験では、構造関連特性と潜在的機能注釈を持つシーケンスを生成できることが示唆される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。