QUICK REVIEW

[論文レビュー] Semi-supervised multiple testing

David Mary, Étienne Roquain|arXiv (Cornell University)|Jun 25, 2021

Statistical Methods in Clinical Trials参考文献 58被引用数 1

ひとこと要約

本稿では、母数分布の知識を必要とせず、母数訓練標本（NTS）に基づく経験的p値アプローチを用いて、発見率（FDR）を制御する半教師付き多重仮説検定フレームワークを提案する。Benjamini-Hochberg（BH）手順における経験的p値の理論的境界を確立し、NTSのサイズn ≳ m / (α max(1, k)) のとき、FDR制御が達成可能であることを示す。ここでmは仮説検定の数、kは検出可能な代替仮説の数である。

ABSTRACT

An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical $p$-values. These bounds match when $\alpha (n+1)/m$ is an integer, where $n$ is the NTS sample size and $m$ is the number of tests. Second, we give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when $n$ is sufficiently large in front of $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of ``detectable'' alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the empirical BH method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.

研究の動機と目的

既存の多重仮説検定手順が母数分布を事前に知っている必要があるという制限を解決すること。
母数分布が未知であるが、その分布からの標本（NTS）が利用可能な半教師付き設定において、母数分布に依存しないFDR制御手法を開発すること。
NTSから導かれる経験的p値を用いたBH手順の性能を理論的に分析すること。
未知の母数分布に起因する損失を最小限に抑え、近似的にオラクルに近いパワーを達成できる条件を確立すること。
フェーズ遷移解析を用いて、提案手法の最適性と本質的限界を実証すること。

提案手法

未知の母数分布から抽出されたサイズnの母数訓練標本（NTS）を用いて、各仮説検定の経験的p値を計算する。
得られた経験的p値にBenjamini-Hochberg（BH）手順を適用し、発見率（FDR）を制御する。
経験的BH手順のFDRに対する上界と下界を導出し、α(n+1)/m が整数であるとき両者が一致することを示す。
NTSサイズnと検出可能な代替仮説の数kの間のトレードオフを定量化するためのパワー解析を実施する。
n ≍ m におけるフェーズ遷移を特定し、母数分布が未知である場合、n ≲ m ではオラクルに類似したパワーでのFDR制御が不可能であることを示す。
数値実験を通じて理論的結果を検証し、調整定数を必要としないスケーリング則の正しさを確認する。

実験結果

リサーチクエスチョン

RQ1母数分布が未知であるがその標本が利用可能な状況で、FDRを制御することは可能か？
RQ2母数訓練標本（n）のサイズが、経験的BH手順のFDRとパワーの両面での性能にどのように影響するか？
RQ3経験的BH手順がFDR制御とパワーの両面で最適となる理論的境界は何か？
RQ4半教師付き多重仮説検定問題には、NTSが小さすぎる場合に性能を制限する本質的フェーズ遷移が存在するか？
RQ5提案手法は、銀河検出のような分野（例：Maryら, 2020）における実用的手法に理論的根拠を提供できるか？

主な発見

経験的BH手順のFDRは上界と下界で抑えられ、α(n+1)/m が整数であるとき両者が一致する。
n ≳ m / (α max(1, k)) のとき、経験的BH手順のパワーはオラクルBH手順に近づく。ここでkは検出可能な代替仮説の数である。
n ≍ m における本質的フェーズ遷移が存在し、母数分布が未知である場合、n ≲ m ではオラクルに類似したパワーでのFDR制御は不可能になる。
経験的BH手順は、その性能境界がフェーズ遷移の閾値と一致するという意味で最適である。
数値実験により、導出されたスケーリング則 n ≳ m / (α max(1, k)) が正しいオーダーであることが確認され、調整定数の追加は不要である。
理論的枠組みは、Maryら（2020）が提案した銀河検出手順に強く理論的根拠を提供し、天体物理学的データ解析におけるその有用性を裏付けている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。