QUICK REVIEW

[論文レビュー] OpenProteinSet: Training data for structural biology at scale

Gustaf Ahdritz, Nazim Bouatta|arXiv (Cornell University)|Aug 10, 2023

Machine Learning in Bioinformatics被引用数 15

ひとこと要約

OpenProteinSet は、16 million を超える MSA と構造同源物および AlphaFold2 予測を含む、大規模なオープンソースコーパスで、AlphaFold2 のスケール以上を対象としたタンパク質機械学習モデルの訓練を目的として設計されています。さらに、フィルタリングされた多様なサブセットとして 270,000 MSAs と対応する構造予測が含まれています。

ABSTRACT

Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.

研究の動機と目的

Motivate open access to large-scale MSA data to advance protein structure prediction and related tasks.
Provide a diverse, deep, and reusable MSA corpus comparable in scale to AlphaFold2 training data.
Offer associated structural templates and AlphaFold2-predicted structures for training with high-quality MSAs.
Enable evaluation and validation frameworks for open-source protein modeling with OpenFold and similar models.

提案手法

Assembled MSAs for all unique PDB chains (140k) and for Uniclust30 clusters (16M MSAs).
Computed three MSAs per chain using different tools and databases (JackHMMer with MGnify and UniRef90; HHblits with BFD and Uniclust30).
Generated a filtered, diverse, and deep subset of 270,262 MSAs by removing redundancies and applying length cutoffs (200–1024 residues).
Identified template hits via HHSearch against PDB70 and produced OpenFold-based structure predictions for representative chains.
Provided associated templates in HHSearch format and structures in PDB format; all data released under CC BY 4.0.
Demonstrated utility by retraining OpenFold (AlphaFold2 open replication) and comparing performance to original AlphaFold2.]
research_questions:[
How large-scale open MSA datasets can be constructed to match the scale of proprietary training sets like AlphaFold2.
What subset of MSAs balances depth and diversity for effective AlphaFold2-style training.
How high-quality MSA-derived templates and structure predictions from OpenProteinSet influence training outcomes for protein structure prediction models.

実験結果

リサーチクエスチョン

RQ1How large-scale open MSA datasets can be constructed to match the scale of proprietary training sets like AlphaFold2.
RQ2What subset of MSAs balances depth and diversity for effective AlphaFold2-style training.
RQ3 How high-quality MSA-derived templates and structure predictions from OpenProteinSet influence training outcomes for protein structure prediction models.

主な発見

Protein origin	Count (approx.)	MSA	Template hits	Structure
PDB (all unique chains)	140,000	✓	✓	Experimentally determined
Uniclust30 (filtered)	270,000	✓	✓	Predicted by AlphaFold2
Uniclust30 (unfiltered)	16 million	✓	×	×

OpenProteinSet includes over 16 million Uniclust30 MSAs plus PDB-chain MSAs and AlphaFold2-like structure predictions.
From Uniclust30, a diverse, deep subset of 270,262 MSAs was selected with associated template hits and structure predictions.
OpenFold trained on OpenProteinSet achieved near-parity with AlphaFold2 on CASP15 domains (mean GDT-TS: 73.8 vs 74.6; OpenFold was at least as good on 50% of targets).
On a 180-protein validation set (CAMEO), final OpenFold model achieved lDDT-Cα around 0.907, with low run-to-run variability across seeds.
The OpenProteinSet MSAs represent millions of compute-hours and demonstrate effective replication of AlphaFold2-scale training in an open framework.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。