Skip to main content
QUICK REVIEW

[Paper Review] The xyz algorithm for fast interaction search in high-dimensional data

Gian-Andrea Thanei, Nicolai Meinshausen|arXiv (Cornell University)|Oct 17, 2016
Gene expression and cancer classification27 references7 citations
TL;DR

The xyz algorithm is a randomized, subquadratic-time method for fast interaction search in high-dimensional data, transforming interaction detection into a closest-pair problem via random projections. It enables near-linear time discovery of strong interactions and O(p^α) scaling for weaker ones, achieving screening of over 10^11 interactions in under 280 seconds on a single-core CPU, with theoretical guarantees and an R implementation available on CRAN and GitHub.

ABSTRACT

When performing regression on a dataset with $p$ variables, it is often of interest to go beyond using main linear effects and include interactions as products between individual variables. For small-scale problems, these interactions can be computed explicitly but this leads to a computational complexity of at least $\mathcal{O}(p^2)$ if done naively. This cost can be prohibitive if $p$ is very large. We introduce a new randomised algorithm that is able to discover interactions with high probability and under mild conditions has a runtime that is subquadratic in $p$. We show that strong interactions can be discovered in almost linear time, whilst finding weaker interactions requires $\mathcal{O}(p^α)$ operations for $1 < α< 2$ depending on their strength. The underlying idea is to transform interaction search into a closestpair problem which can be solved efficiently in subquadratic time. The algorithm is called $\mathit{xyz}$ and is implemented in the language R. We demonstrate its efficiency for application to genome-wide association studies, where more than $10^{11}$ interactions can be screened in under $280$ seconds with a single-core $1.2$ GHz CPU.

Motivation & Objective

  • Address the computational infeasibility of exhaustive pairwise interaction search in high-dimensional data, especially when p is large.
  • Overcome the O(p²) complexity of naive interaction screening, which becomes prohibitive for large p.
  • Develop a method that can efficiently detect strong and weak interactions with subquadratic runtime scaling.
  • Provide theoretical guarantees on interaction recovery under mild moment and tail conditions.
  • Enable practical application to large-scale problems such as genome-wide association studies (GWAS) with massive interaction spaces.

Proposed method

  • Transform interaction search into a closest-pair problem by redefining predictors using the response vector, leading to the condition ∥Xj − Zk∥² < κ′ for Zij = YiXij.
  • Apply random projections to reduce each of the 2p vectors (X and Z) to one dimension, enabling efficient sorting in O(p log p) time.
  • Leverage the fact that random projections preserve relative distances with high probability, allowing subquadratic runtime via sorting-based nearest-neighbor approximation.
  • Formulate the method as a locality-sensitive hashing (LSH) scheme optimized for interaction detection, with theoretical bounds on false positive and false negative rates.
  • Integrate the xyz algorithm into a Lasso-based framework to fit models with all main effects and pairwise interactions at subquadratic cost.
  • Implement the core algorithm and its Lasso extension in the R package 'xyz', available on CRAN and GitHub for reproducible research.

Experimental results

Research questions

  • RQ1Can interaction search in high-dimensional data be performed in subquadratic time in p while maintaining high detection power?
  • RQ2To what extent can random projections reduce the complexity of interaction detection without sacrificing accuracy?
  • RQ3How does the algorithm's runtime scale with interaction strength, and can it achieve near-linear time for strong interactions?
  • RQ4What are the theoretical guarantees on the probability of correctly identifying true interactions under mild moment and tail assumptions?
  • RQ5Can the method be efficiently scaled to real-world problems such as GWAS with p > 10^6 variables and over 10^11 possible interactions?

Key findings

  • The xyz algorithm achieves a runtime of O(np) for strong interactions when the signal-to-noise ratio is high, approaching linear time in p.
  • Weaker interactions are detected in O(p^α) time for 1 < α < 2, with α decreasing as interaction strength increases.
  • The algorithm can screen more than 10^11 pairwise interactions in under 280 seconds using a single-core 1.2 GHz CPU, demonstrating practical scalability.
  • Theoretical analysis shows that with high probability, the true interaction pair is separated from non-interacting pairs by a margin that grows with sample size n.
  • The method achieves high detection power even when main effects are masked by interaction effects, outperforming main-effect-first strategies in challenging signal configurations.
  • The R package 'xyz' provides a fully reproducible implementation of the algorithm and its Lasso extension, supporting large-scale statistical modeling.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.