QUICK REVIEW

[論文レビュー] Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muhammad Muneeb, David B. Ascher|arXiv (Cornell University)|Mar 6, 2026

Genetic Associations and Epidemiology被引用数 0

ひとこと要約

この研究は openSNP からの 80 個の二値表現型を 29 個 ML、80 個 DL、3 個 PRS ツールで評価し、675 のクロッピング/プリューニング構成を用いて平均 5 分割の AUC を報告し、ML/DL と PRS アプローチ間の性能を比較する。

ABSTRACT

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

研究の動機と目的

openSNP由来の二値表現型で遺伝子型-表現型予測をベンチマークする。
ケース-コントロール分類のために機械学習、深層学習、ポリジェニックリスクスコアツールを比較する。
SNP 前選択と PRS パラメータを体系的に変化させ、性能への影響を評価する。
環境要因が遺伝予測を制約する表現型を強調し、転移学習の考慮を議論する。

提案手法

openSNP 表現データを前処理して 80 の二値表現型を抽出し、Plink 形式へ変換する。
閾値（MAF 0.01、HWE 1e-6、ジェノタイプ率 0.01、欠測率 0.7）で品質管理を行い、重複を除去する。
ML/DL の場合：GWAS の p 値閾値を用いて SNP を選択（50–10000 の SNP 変異体）、29 の ML および 80 の DL モデルをさまざまなハイパーパラメータで訓練し、平均 5-fold AUC を報告する。
PRS の場合：訓練データから GWAS ベースのベースファイルを生成し、品質管理を適用し、Plink、PRSice、Lassosum の 675 パラメータ組み合わせでプリューニングとクロッピングを実行する；評価のために PRS を二値化に変換する。
ML/DL および PRS ツール間で AUC を用いて性能を比較し、表現型ごとに最良のモデルを統合する。

Figure 1: A workflow of genotype-phenotype prediction using ML/DL and PRS. A case/control classification flowchart using ML/DL and PRS tools. First, clean phenotype data and extract binary phenotypes from the openSNP dataset. Second, merge the genotype data for each phenotype, convert the dataset to

実験結果

リサーチクエスチョン

RQ1どの表現型が ML/DL モデルよりも PRS ツールで予測しやすいか？
RQ2SNP 前選択閾値とクロッピング/プリューニングパラメータが方法間の予測性能にどのように影響するか？
RQ3個別の表現型で最高の AUC を生み出す特定のモデル/ハイパーパラメータは何か？
RQ4環境要因が遺伝予測を制限する表現型はあるか、転移学習はデータ不足を緩和できるか？

主な発見

ML/DL は 44 表現型で PRS ツールより優れており、PRS ツールは 36 表現型で優れていた。
ANN は 26 表現型で最良の DL アルゴリズム、XGBoost は 11 表現型で最良の ML アルゴリズム。
最良の DL ハイパーパラメータ：ドロップアウト 0.2、オプティマイザ Adam、バッチサイズ 1、エポック数 50（23 表現型）。
PRS ツールでは Plink が PRS ツールの中でしばしば最良で、クロッピング閾値 0.1 がより良い結果を与えることが多い；Lassosum はプリューニングパラメータの変化から恩恨を得た。
いくつかの表現型（例：Type II diabetes、Seborrhoeic Dermatitis、Aphantasia、Eczema、Hypertension、Plantar Fasciitis、Fibromyalgia）では ML/DL が AUC ≥ 80% を達成；Scoliosis、Restless leg syndrome、Misophonia、Hypertriglyceridemia、Bone Mineral Density では PRS が AUC ≥ 80% を達成。
総合結論：最良のモデルはデータ品質、表現型の構造、ハイパーパラメータに依存する；openSNP データは解釈を制限する；転移学習は戦略として議論される。

Figure 2: This diagram shows the AUC for each phenotype obtained from the ML/DL algorithms and group phenotypes on the number of SNPs that yield the best results.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。