QUICK REVIEW

[论文解读] Important Feature PCA for high dimensional clustering

Jiashun Jin, Wanjie Wang|arXiv (Cornell University)|Jul 20, 2014

Sparse and Compressive Sensing Techniques参考文献 43被引用 6

一句话总结

本文提出重要特征主成分分析（IF-PCA），一种针对高维数据（p ≫ n）的无调参聚类方法，通过使用适应高阶临界值的阈值选择KS统计量最高的特征，然后对归一化后的选择后数据矩阵的前（K−1）个左奇异向量应用k-means聚类。IF-PCA实现了聚类一致性，并表现出优异性能，在三个基因微阵列数据集上将错误率降低至其他方法的29%或以下。

ABSTRACT

We consider a clustering problem where we observe feature vectors Xi ∈ R, i = 1, 2, . . . , n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Important Features PCA (IF-PCA) as a new clustering procedure. In IFPCA, we select a small fraction of features with the largest Kolmogorov-Smirnov (KS) scores, where the threshold is chosen by adapting the recent notion of Higher Criticism, obtain the first (K − 1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means to these singular vectors. It can be seen that IF-PCA is a tuning free clustering method. We apply IF-PCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only 29% or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by [16] on microarray data. With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov-Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.

研究动机与目标

为解决高维数据（p ≫ n）聚类问题，传统方法在此类场景下失效。
开发一种鲁棒、自适应且无调参的聚类方法。
在高维渐近条件下，为聚类一致性提供理论保证。
揭示聚类、稀疏主成分分析与低秩矩阵恢复问题中的相变现象。
通过严格的后选择特征值分析，重新发现并分析微阵列数据中的经验零假设现象。

提出的方法

选择KS检验统计量最大的一小部分特征，以识别信息性特征。
通过适应高阶临界法检测微弱信号，确定特征选择的阈值。
对所选特征进行归一化，形成后选择数据矩阵。
计算归一化后选择后矩阵的前（K−1）个左奇异向量。
对（K−1）个奇异向量应用经典的k-means聚类以估计类别标签。
利用后选择特征值分析，推导KS统计量的紧致概率界，确保理论一致性。

实验结果

研究问题

RQ1在p ≫ n的高维聚类中，如何有效识别信息性特征？
RQ2在存在微弱信号的高维设置下，聚类一致性的理论保证是什么？
RQ3在真实基因微阵列数据上，IF-PCA与现有方法相比错误率如何？
RQ4聚类、稀疏主成分分析与低秩矩阵恢复中存在何种相变现象，它们之间有何关联？
RQ5微阵列数据中观察到的经验零假设现象能否通过后选择分析得到严格解释？

主要发现

IF-PCA在广泛条件下通过后选择KS统计量的紧致概率界实现了聚类一致性。
在三个基因微阵列数据集上，IF-PCA将错误率降低至其他方法的29%或以下。
该方法为无调参方法，仅依赖KS得分与高阶临界法进行阈值选择。
识别出一种相变现象，可区分聚类、稀疏主成分分析与低秩矩阵恢复问题中的可行性区域。
研究通过后选择特征值分析证实并解释了微阵列数据中观察到的经验零假设现象。
理论分析表明，即使在p ≫ n时，IF-PCA仍能保持强大的有限样本性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。