Skip to main content
QUICK REVIEW

[Paper Review] Constraint-based Causal Discovery from Multiple Interventions over Overlapping Variable Sets

Sofia Triantafillou, Ioannis Tsamardinos|arXiv (Cornell University)|Mar 10, 2014
Bayesian Modeling and Causal Inference51 references93 citations
TL;DR

COmbINE is a constraint-based causal discovery algorithm that integrates multiple heterogeneous datasets from overlapping variable sets under different interventions, using SAT encoding of dependency constraints to infer invariant and variant causal structures. It improves efficiency and handles statistical conflicts via confidence-ranked constraint processing, outperforming prior methods on real-world mass-cytometry data.

ABSTRACT

Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions, manipulations, perturbations) and measure different sets of quantities (variables). The result is a collection of heterogeneous data sets coming from different data distributions. In this work, we present algorithm COmbINE, which accepts a collection of data sets over overlapping variable sets under different experimental conditions; COmbINE then outputs a summary of all causal models indicating the invariant and variant structural characteristics of all models that simultaneously fit all of the input data sets. COmbINE converts estimated dependencies and independencies in the data into path constraints on the data-generating causal model and encodes them as a SAT instance. The algorithm is sound and complete in the sample limit. To account for conflicting constraints arising from statistical errors, we introduce a general method for sorting constraints in order of confidence, computed as a function of their corresponding p-values. In our empirical evaluation, COmbINE outperforms in terms of efficiency the only pre-existing similar algorithm; the latter additionally admits feedback cycles, but does not admit conflicting constraints which hinders the applicability on real data. As a proof-of-concept, COmbINE is employed to co-analyze 4 real, mass-cytometry data sets measuring phosphorylated protein concentrations of overlapping protein sets under 3 different interventions.

Motivation & Objective

  • To address the challenge of integrating multiple heterogeneous datasets from different experimental conditions with overlapping variables.
  • To develop a method that jointly infers causal structures across datasets while identifying invariant and variant causal characteristics.
  • To handle statistical errors and conflicting constraints in real-world data through confidence-based constraint ranking.
  • To scale efficiently to larger datasets compared to existing algorithms that cannot manage conflicting constraints.

Proposed method

  • Convert statistical dependencies and independencies from each dataset into path constraints on the underlying causal model.
  • Encode all constraints as a Boolean satisfiability (SAT) problem using a compact representation to improve scalability.
  • Rank constraints by confidence using p-values from statistical independence tests to resolve conflicts.
  • Use Maximal Ancestral Graphs (MAGs) and Semi-Markov Causal Models (SMCMs) to represent and reason about causal structures under interventions.
  • Apply a greedy constraint addition strategy: add constraints in order of increasing confidence and discard conflicting ones.
  • Leverage modern SAT solvers to efficiently compute all models that simultaneously fit all input datasets.

Experimental results

Research questions

  • RQ1Can a unified causal model be learned from multiple datasets with overlapping variables and different interventions?
  • RQ2How can conflicting constraints arising from statistical errors be resolved during causal discovery?
  • RQ3What is the impact of sample size and number of datasets on the accuracy and efficiency of causal inference?
  • RQ4How does COmbINE compare in performance and scalability to existing algorithms that do not handle conflicting constraints?
  • RQ5To what extent can COmbINE identify invariant and variant causal structures across multiple experimental conditions?

Key findings

  • COmbINE outperforms the only pre-existing similar algorithm in terms of computational efficiency and scalability to larger problem sizes.
  • The algorithm successfully handles conflicting constraints through confidence-based ranking, enabling application to real-world data where statistical errors are common.
  • Empirical evaluation shows that COmbINE maintains high accuracy even with small sample sizes and multiple datasets.
  • The conflict resolution technique in COmbINE significantly outperforms alternative methods in terms of precision and recall of causal features.
  • In a proof-of-concept on 4 real mass-cytometry datasets, COmbINE identified consistent causal patterns across interventions, demonstrating practical utility.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.