QUICK REVIEW

[论文解读] Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

Lingsong Meng, Dorina Avram|arXiv (Cornell University)|Mar 18, 2021

Gene expression and cancer classification参考文献 62被引用 8

一句话总结

本文提出了一种名为Outcome-guided Sparse K-means（GuidedSparseKmeans）的新颖聚类方法，通过整合高维转录组数据与临床结局变量，识别出具有生物学意义的疾病亚型。通过在统一的目标函数下联合优化样本聚类、通过Lasso正则化进行基因选择以及基于结局的聚类，该方法在模拟实验和乳腺癌、阿尔茨海默病等真实世界应用中，相较于现有稀疏聚类方法，显著提升了亚型的可解释性和性能。

ABSTRACT

The discovery of disease subtypes is an essential step for developing precision medicine, and disease subtyping via omics data has become a popular approach. While promising, subtypes obtained from existing approaches are not necessarily associated with clinical outcomes. With the rich clinical data along with the omics data in modern epidemiology cohorts, it is urgent to develop an outcome-guided clustering algorithm to fully integrate the phenotypic data with the high-dimensional omics data. Hence, we extended a sparse K-means method to an outcome-guided sparse K-means (GuidedSparseKmeans) method. An unified objective function was proposed, which was comprised of (i) weighted K-means to perform sample clusterings; (ii) lasso regularizations to perform gene selection from the high-dimensional omics data; (iii) incorporation of a phenotypic variable from the clinical dataset to facilitate biologically meaningful clustering results. By iteratively optimizing the objective function, we will simultaneously obtain a phenotype-related sample clustering results and gene selection results. We demonstrated the superior performance of the GuidedSparseKmeans by comparing with existing clustering methods in simulations and applications of high-dimensional transcriptomic data of breast cancer and Alzheimer's disease. Our algorithm has been implemented into an R package, which is publicly available on GitHub (https://github.com/LingsongMeng/GuidedSparseKmeans).

研究动机与目标

解决现有聚类方法产生的亚型缺乏生物学或临床相关性的问题。
整合高维转录组数据与多种临床结局变量（连续型、二值型、生存时间等），以指导聚类。
在确保识别出的亚型与临床有意义的结局相关联的前提下，同时执行基因选择与样本聚类。
构建统一的优化框架，以平衡内在基因信号与结局导向聚类。
通过整合乳腺癌中的ER状态或阿尔茨海默病中的Braak分期等领域特定临床标志物，提升疾病分型的可解释性与可重复性。

提出的方法

构建统一的目标函数，结合加权K-means聚类、用于基因选择的Lasso正则化以及临床结局引导项。
采用交替优化算法，迭代更新聚类分配、基因权重与结局系数。
通过目标函数中的灵活链接函数，整合多种类型的临床结局（连续型、二值型、有序型、计数型、生存时间等）。
应用Lasso惩罚，以选择在聚类和临床结局方面均最相关的稀疏基因子集。
利用轮廓统计量、敏感性分析与扩展轮廓统计量，估计调参（K, λ, s），以平衡模型复杂度与结局关联性。
在GitHub上发布R包，供公众使用与结果复现。

实验结果

研究问题

RQ1结局导向聚类能否提升从高维转录组数据中识别出的疾病亚型的生物学相关性？
RQ2整合HER2状态或Braak分期等临床结局变量，对亚型发现的准确性和可解释性有何影响？
RQ3所提出的方法在识别临床有意义亚型方面，相较于标准稀疏K-means及其他聚类方法，优势有多大？
RQ4当聚类数量（K）或所选基因数量存在误设时，该方法的稳健性如何？
RQ5该方法能否在统一框架下有效处理包括生存时间、二值型和连续型变量在内的多样化临床结局类型？

主要发现

在模拟实验中，GuidedSparseKmeans方法显著优于标准稀疏K-means，Rand调整指数最高达0.85，且真实亚型分离效果更优。
在METABRIC乳腺癌数据集（n=1,870例样本，12,180个基因）中，HER2引导的模型识别出的亚型在生存差异上最为显著（p < 0.001），且激素信号通路显著富集。
在阿尔茨海默病RNA-seq数据集（n=217例样本，15,363个基因）中，Braak分期引导的模型生成了具有强生物学可解释性的聚类，与神经纤维缠结进展高度相关。
该方法计算速度快：乳腺癌数据集耗时31秒，阿尔茨海默病数据集仅需7秒，展现出良好的可扩展性。
在结局引导下，基因选择更准确：模拟实验中前10%的高分基因中，80–90%为真实致病相关基因，而标准稀疏K-means仅为50–60%。
在真实数据中，该方法对K的误设具有中等稳健性，但在具有清晰、明确聚类结构的模拟实验中性能下降，表明对聚类结构敏感。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。