QUICK REVIEW

[論文レビュー] Selective inference for k-means clustering

Yiqun T. Chen, Daniela Witten|PubMed|Mar 29, 2022

Single-cell and spatial transcriptomics参考文献 30被引用数 23

ひとこと要約

本論文は、k-means によって識別された2つのクラスタ間の平均の差を検定するための有限サンプルの選択的推論p値を開発し、データ分割を行わずに選択的I型誤りを制御することを保証する。

ABSTRACT

We consider the problem of testing for a difference in means between clusters of observations identified via <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>k</mml:mi></mml:math>-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

研究の動機と目的

データ駆動クラスタリングで定義されたクラスタ間の平均差の検定を動機づける。
クラスタベースの仮説検定でI型誤りの膨張に対処する。
k-meansクラスタリングの有限サンプル選択的推論フレームワークを開発する。
クラスタリングの結果を条件とした正確なp値計算を提供する。

提案手法

k-means によって推定された2つのクラスタ間の差の無効仮説として H0: μ^T ν = 0 を定式化する。
k-means アルゴリズムによって生成された全クラスタリング経路を条件として選択的p値 p_selective を開発する。
p_selective が、集合 S_T に制限されたスケーリングされた χ_q 変数の生存関数に等しいことを示す。
whitening を用いた非球面共分散への拡張または既知の Σ による拡張と、調整後の p 値 p_{Σ,selective} を提供する。
未知の分散 σ の扱いについて、一貫した推定量を用いて対応する調整後 p 値を提供する。
R パッケージ KmeansInference で実装し、再現可能なコードを提供する。

実験結果

リサーチクエスチョン

RQ1k-means で得られたクラスタ間の平均差を検定するための有限サンプル、選択的推論ベースのp値を構築できるか。
RQ2H0 の下で k-means の全クラスタリング経路を条件付けることにより選択的I型誤りを制御できるか。
RQ3選択的p値を効率的に計算できるか、非球面共分散構造および未知分散へ拡張できるか。
RQ4実データセット（手書き数字、単細胞RNA-seqなど）に対して実用的に適用可能で、クラスタリング後の推定で妥当性を得られるか。

主な発見

クラスタリングを無視するナイーブな検定はI型誤差の膨張を招く。
提案された p_selective は選択的I型誤差をレベル α で制御する。
p値はスケーリングされた χ_q 変数の切り詰められた生存関数として計算でき、集合 S_T の特徴づけを要する。
whitening や既知 Σ による非球面共分散の拡張と、調整後の p 値 p_{Σ,selective} が可能。
未知の σ は一貫した推定量で対処でき、漸近的な選択的I型誤差制御をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。