QUICK REVIEW

[Paper Review] High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning

Francis Bach|ArXiv.org|Sep 4, 2009

Domain Adaptation and Few-Shot Learning82 references56 citations

TL;DR

This paper proposes a hierarchical kernel learning framework for high-dimensional non-linear variable selection by embedding an exponential number of basis kernels into a directed acyclic graph (DAG), enabling efficient sparsity-inducing optimization in polynomial time. The method achieves consistent variable selection even when the number of irrelevant variables grows exponentially with sample size, outperforming state-of-the-art methods on synthetic and UCI datasets.

ABSTRACT

We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive performance for non-linear regression problems.

Motivation & Objective

Address the challenge of non-linear variable selection in high-dimensional settings where traditional linear methods fail due to complex interactions.
Overcome the computational intractability of selecting from exponentially many non-linear kernels by exploiting a natural hierarchical structure.
Develop a sparsity-inducing regularization framework based on a graph-adapted norm that restricts valid sparsity patterns to those compatible with a DAG.
Establish theoretical consistency of variable selection under high-dimensional asymptotics, allowing the number of irrelevant variables to grow exponentially with sample size.
Demonstrate state-of-the-art predictive performance on non-linear regression tasks through extensive simulations on synthetic and UCI benchmark datasets.

Proposed method

Model non-linear interactions using a sum of positive definite basis kernels indexed by subsets of input variables or multi-dimensional indices in {0,…,q}^p.
Embed the set of basis kernels into a directed acyclic graph (DAG) to exploit hierarchical relationships among variable interactions.
Introduce a graph-adapted sparsity-inducing norm derived from a combination of ℓ2-norms over parent-child relationships in the DAG to control kernel selection.
Formulate the optimization problem as a multiple kernel learning task with a regularization term that promotes sparse selection over the DAG-structured kernel space.
Design a polynomial-time algorithm for kernel selection by leveraging the DAG structure to avoid brute-force enumeration of all possible kernel combinations.
Use representer theorems and Hilbertian regularization to work in implicit feature spaces while maintaining computational tractability.

Experimental results

Research questions

RQ1Can we efficiently perform non-linear variable selection in high-dimensional settings where the number of potential interactions is exponential in the input dimension?
RQ2How can we structure the space of non-linear kernels to enable polynomial-time selection while preserving statistical consistency?
RQ3What regularization framework allows consistent variable selection when the number of irrelevant variables grows exponentially with the number of observations?
RQ4Can a DAG-based kernel embedding improve both computational efficiency and predictive performance compared to standard multiple kernel learning?
RQ5To what extent does the proposed method adapt to complex, high-order interactions without overfitting in high-dimensional regimes?

Key findings

The proposed method enables polynomial-time kernel selection over an exponential number of basis kernels by exploiting a DAG structure, avoiding intractable enumeration.
Theoretical analysis shows that under appropriate assumptions, the method achieves consistent variable selection even when the number of irrelevant variables is exponential in the number of observations.
The framework allows for the selection of non-linear interactions up to order p, including all possible subsets of p variables, which is necessary for universal consistency.
Empirical results on synthetic and UCI datasets demonstrate state-of-the-art predictive performance for non-linear regression tasks.
The method achieves strong generalization by combining sparsity-inducing regularization with kernel-based learning in high-dimensional implicit feature spaces.
Theoretical bounds on estimation error and eigenvalue stability are derived, showing that the method remains robust under model misspecification and finite-sample effects.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.