QUICK REVIEW

[논문 리뷰] Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muhammad Muneeb, David B. Ascher|arXiv (Cornell University)|2026. 03. 06.

Genetic Associations and Epidemiology인용 수 0

한 줄 요약

본 연구는 openSNP에서 80개의 이진 표현형을 29개의 ML, 80개의 DL, 3개의 PRS 도구로 벤치마크하고 675개의 클럼핑/프루닝 구성에서 평균 5-폴드 AUC를 보고하여 ML/DL과 PRS 접근법 간의 성능 차이를 비교합니다.

ABSTRACT

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

연구 동기 및 목표

openSNP 유래 이진 표현형에 대한 유전체-표현형 예측 벤치마크.
케이스-컨트롤 분류를 위한 머신러닝, 딥러닝, 다유전 위험도 점수 도구 비교.
SNP 사전선택 및 PRS 매개변수를 체계적으로 다양화하여 성능 영향 평가.
환경 요인이 유전 예측을 제한하는 표현형을 강조하고 전이 학습 고려사항에 대해 논의합니다.

제안 방법

openSNP 표현형 데이터를 전처리하여 80개의 이진 표현형을 추출하고 Plink 형식으로 변환합니다.
MAF 0.01, HWE 1e-6, 유전체형/유전자형 비율 0.01, 누락률 0.7의 임계값으로 유전체 데이터의 품질 관리 및 중복 제거를 수행합니다.
ML/DL의 경우 GWAS의 p-값 임계값을 사용하여 SNP를 선택하고(50–10000개 SNP), 다양한 하이퍼파라미터로 29개의 ML 및 80개의 DL 모델을 학습하며 평균 5-폴드 AUC를 보고합니다.
PRS의 경우 학습 데이터에서 GWAS 기반의 기본 파일을 생성하고 품질 관리, Plink, PRSice, Lassosum 도구를 통해 675 매개변수 조합으로 프럼핑/클럼핑을 수행합니다; 평가를 위해 PRS를 이진으로 변환합니다.
ML/DL 및 PRS 도구 간의 AUC를 비교하고 표현형별로 성능이 가장 우수한 모델을 종합합니다.

Figure 1: A workflow of genotype-phenotype prediction using ML/DL and PRS. A case/control classification flowchart using ML/DL and PRS tools. First, clean phenotype data and extract binary phenotypes from the openSNP dataset. Second, merge the genotype data for each phenotype, convert the dataset to

실험 결과

연구 질문

RQ1어떤 표현형이 ML/DL 모델에 의해 PRS 도구보다 더 잘 예측되는가?
RQ2SNP 사전선택 임계값 및 클럼핑/프럼핑 매개변수가 방법 간 예측 성능에 어떤 영향을 미치는가?
RQ3개별 표현형에서 어떤 특정 모델/하이퍼파라미터가 가장 높은 AUC를 제공하는가?
RQ4환경 요인이 유전 예측을 제한하는 표현형이 존재하는가, 그리고 데이터 부족 문제를 해결하기 위한 전이 학습이 도움이 되는가?

주요 결과

ML/DL이 44개의 표현형에서 PRS 도구보다 우수하게 작동했고, 36개의 표현형에서는 PRS 도구가 우수했습니다.
ANN은 26개 표현형에서 최적의 DL 알고리즘이었고; XGBoost는 11개 표현형에서 최적의 ML 알고리즘이었습니다.
최고의 DL 하이퍼파라미터: 드롭아웃 0.2, 옵티마이저 Adam, 배치 크기 1, 에폭 50(23개 표현형).
PRS 도구 중 Plink가 PRS 도구들 중에 종종 최상의 성능을 보였고, 프럼핑 임계값 0.1이 더 나은 결과를 낳았으며; Lassosum은 프럼핑 매개변수의 변화로 이득을 얻었습니다.
일부 표현형(예: 제2형 당뇨병의 Type II, 지루 피부염, 무시각증, 습진, 고혈압, 족저근막염, 섬유근통증)에서 ML/DL은 AUC ≥ 80%를 달성했고; 척추측만증, 하지불안증후군, 미소피니아, 고중성지방혈증, 골密도에서 PRS가 AUC ≥ 80%를 달성했습니다.
종합 결론: 최적의 모델은 데이터 품질, 표현형 구조, 하이퍼파라미터에 따라 달라지며; openSNP 데이터의 한계로 해석의 제약이 있으며, 전이 학습은 데이터 부족 문제를 완화하는 전략으로 논의됩니다.

Figure 2: This diagram shows the AUC for each phenotype obtained from the ML/DL algorithms and group phenotypes on the number of SNPs that yield the best results.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.