QUICK REVIEW

[Paper Review] Comment on "Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011

Noah Simon, Robert Tibshirani|arXiv (Cornell University)|Jan 29, 2014

Data-Driven Disease Surveillance1 references69 citations

TL;DR

This paper critiques the Maximal Information Correlation (MIC) method proposed by Reshef et al. (2011) for detecting non-linear associations in large datasets, demonstrating through simulations that MIC has consistently lower statistical power than distance correlation (dcor) and Pearson correlation across various noise levels and dependency types—suggesting MIC is prone to false positives in exploratory data analysis despite its claimed equitability.

ABSTRACT

The proposal of Reshef et al. (2011) is an interesting new approach for discovering non-linear dependencies among pairs of measurements in exploratory data mining. However, it has a potentially serious drawback. The authors laud the fact that MIC has no preference for some alternatives over others, but as the authors know, there is no free lunch in Statistics: tests which strive to have high power against all alternatives can have low power in many important situations. To investigate this, we ran simulations to compare the power of MIC to that of standard Pearson correlation and distance correlation (dcor). We simulated pairs of variables with different relationships (most of which were considered by the Reshef et. al.), but with varying levels of noise added. To determine proper cutoffs for testing the independence hypothesis, we simulated independent data with the appropriate marginals. As one can see from the Figure, MIC has lower power than dcor, in every case except the somewhat pathological high-frequency sine wave. MIC is sometimes less powerful than Pearson correlation as well, the linear case being particularly worrisome.

Motivation & Objective

To evaluate the statistical power of MIC, a method proposed for detecting non-linear associations in large datasets.
To investigate whether MIC's claimed equitability comes at the cost of low statistical power.
To compare MIC’s performance against established methods like Pearson correlation and distance correlation (dcor) under controlled simulation conditions.
To assess the reliability of MIC in large-scale exploratory data mining where false positives could be problematic.

Proposed method

Simulated 500 independent datasets for each noise level and dependency type to estimate statistical power.
Used the same marginal distributions as in Reshef et al.'s original study to ensure fair comparison.
Calculated p-values for independence using MIC, Pearson correlation, and dcor, with cutoffs derived from simulations of independent data.
Applied the same significance threshold across all methods to ensure consistency in Type I error control.
Evaluated power across eight different dependency structures, including linear, quadratic, and high-frequency sine waves.
Used R to implement the full simulation pipeline, with code publicly available for reproducibility.

Experimental results

Research questions

RQ1Does MIC maintain high statistical power across diverse non-linear relationships, especially under increasing noise?
RQ2How does MIC's power compare to Pearson correlation and distance correlation in detecting linear and non-linear dependencies?
RQ3Is MIC's equitability property undermined by its low statistical power in practical settings?
RQ4Can MIC produce an unacceptably high rate of false positives in large-scale data mining due to low power?
RQ5Is distance correlation a more robust and powerful alternative to MIC for general-purpose association detection?

Key findings

MIC demonstrated lower statistical power than distance correlation (dcor) in every simulated dependency type except the high-frequency sine wave.
In the linear relationship case, MIC was less powerful than Pearson correlation, which is particularly concerning given that MIC is intended to generalize beyond linearity.
The power advantage of dcor was consistent across all noise levels and dependency structures, indicating superior sensitivity.
MIC's low power suggests it may yield an unacceptably high rate of false positives in large-scale exploratory data analysis.
The authors conclude that dcor is a more powerful, computationally simple, and reliable alternative to MIC for detecting associations in large datasets.
The simulation results indicate that MIC's equitability does not compensate for its poor statistical power, limiting its practical utility.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.