QUICK REVIEW

[Paper Review] The xyz algorithm for fast interaction search in high-dimensional data

Gian-Andrea Thanei, Nicolai Meinshausen|arXiv (Cornell University)|Oct 17, 2016

Gene expression and cancer classification27 references7 citations

TL;DR

The xyz algorithm is a randomized, subquadratic-time method for fast interaction search in high-dimensional data, transforming interaction detection into a closest-pair problem via random projections. It enables near-linear time discovery of strong interactions and O(p^α) scaling for weaker ones, achieving screening of over 10^11 interactions in under 280 seconds on a single-core CPU, with theoretical guarantees and an R implementation available on CRAN and GitHub.

ABSTRACT

When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least $\mathcal{O}(p^2)$ if done naively. This cost can be prohibitive if $p$ is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $\mathcal{O}(p^α)$ operations for $1 < α< 2$ depending on their strength. The underlying idea is to transform interaction search into a closestpair problem which can be solved efficiently in subquadratic time. The algorithm is called $\mathit{xyz}$ and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than $10^{11}$ interactions can be screened in under $280$ seconds with a single-core $1.2$ GHz CPU.

Motivation & Objective

Address the computational infeasibility of exhaustive pairwise interaction search in high-dimensional data, especially when p is large.
Overcome the O(p²) complexity of naive interaction screening, which becomes prohibitive for large p.
Develop a method that can efficiently detect strong and weak interactions with subquadratic runtime scaling.
Provide theoretical guarantees on interaction recovery under mild moment and tail conditions.
Enable practical application to large-scale problems such as genome-wide association studies (GWAS) with massive interaction spaces.

Proposed method

Transform interaction search into a closest-pair problem by redefining predictors using the response vector, leading to the condition ∥Xj − Zk∥² < κ′ for Zij = YiXij.
Apply random projections to reduce each of the 2p vectors (X and Z) to one dimension, enabling efficient sorting in O(p log p) time.
Leverage the fact that random projections preserve relative distances with high probability, allowing subquadratic runtime via sorting-based nearest-neighbor approximation.
Formulate the method as a locality-sensitive hashing (LSH) scheme optimized for interaction detection, with theoretical bounds on false positive and false negative rates.
Integrate the xyz algorithm into a Lasso-based framework to fit models with all main effects and pairwise interactions at subquadratic cost.
Implement the core algorithm and its Lasso extension in the R package 'xyz', available on CRAN and GitHub for reproducible research.

Experimental results

Research questions

RQ1Can interaction search in high-dimensional data be performed in subquadratic time in p while maintaining high detection power?
RQ2To what extent can random projections reduce the complexity of interaction detection without sacrificing accuracy?
RQ3How does the algorithm's runtime scale with interaction strength, and can it achieve near-linear time for strong interactions?
RQ4What are the theoretical guarantees on the probability of correctly identifying true interactions under mild moment and tail assumptions?
RQ5Can the method be efficiently scaled to real-world problems such as GWAS with p > 10^6 variables and over 10^11 possible interactions?

Key findings

The xyz algorithm achieves a runtime of O(np) for strong interactions when the signal-to-noise ratio is high, approaching linear time in p.
Weaker interactions are detected in O(p^α) time for 1 < α < 2, with α decreasing as interaction strength increases.
The algorithm can screen more than 10^11 pairwise interactions in under 280 seconds using a single-core 1.2 GHz CPU, demonstrating practical scalability.
Theoretical analysis shows that with high probability, the true interaction pair is separated from non-interacting pairs by a margin that grows with sample size n.
The method achieves high detection power even when main effects are masked by interaction effects, outperforming main-effect-first strategies in challenging signal configurations.
The R package 'xyz' provides a fully reproducible implementation of the algorithm and its Lasso extension, supporting large-scale statistical modeling.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.