QUICK REVIEW

[Paper Review] Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Francis Bach|ArXiv.org|Sep 9, 2008

Domain Adaptation and Few-Shot Learning21 references175 citations

TL;DR

This paper proposes a hierarchical multiple kernel learning framework that enables efficient sparsity-inducing regularization in large, structured feature spaces by leveraging a directed acyclic graph (DAG) to organize basis kernels. It achieves polynomial-time computation and demonstrates state-of-the-art predictive performance on synthetic and UCI datasets, particularly for nonlinear variable selection.

ABSTRACT

For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

Motivation & Objective

Address the challenge of performing efficient kernel selection in large, potentially infinite-dimensional feature spaces where the number of basis kernels is exponential in input dimension.
Overcome the computational intractability of direct multiple kernel learning in such large spaces by exploiting a hierarchical structure via a directed acyclic graph (DAG).
Introduce a sparsity-inducing regularization framework using block ℓ¹-norms within a DAG-structured kernel decomposition to enable automatic selection of relevant feature subspaces.
Establish theoretical consistency conditions for model selection under the proposed framework, showing it consistently estimates the hull of relevant variables.
Demonstrate empirically that the method achieves superior predictive performance compared to standard ℓ²-regularization and baseline multiple kernel learning on both synthetic and real-world datasets.

Proposed method

Decompose a positive definite kernel as a sum of basis kernels, each associated with a node in a directed acyclic graph (DAG), enabling hierarchical structure in the feature space.
Apply a block ℓ¹-norm regularization over groups of basis kernels, where groups are defined by parent-child relationships in the DAG, to induce sparsity at the group level.
Design an optimization algorithm that exploits the DAG structure to perform kernel selection in polynomial time relative to the number of selected kernels, avoiding exponential complexity.
Use a representer theorem to express the predictor function in terms of the kernel expansion, allowing the optimization to be solved in the dual space with structured sparsity.
Formulate the optimization problem as a convex program with constraints that enforce hierarchical sparsity patterns, ensuring that if a parent kernel is selected, its children may be selected only if the parent is active.
Leverage the dual norm of the group-structured regularization to derive consistency conditions, using the structure of the DAG to bound the dual norm and assess model selection reliability.

Experimental results

Research questions

RQ1Can sparsity-inducing regularization (e.g., ℓ¹ or block ℓ¹) be effectively applied within large, structured feature spaces defined by kernel decompositions?
RQ2Is it possible to perform kernel selection in polynomial time when the number of basis kernels is exponential in input dimension, provided a DAG structure is available?
RQ3Does the proposed hierarchical multiple kernel learning framework lead to improved predictive performance compared to standard ℓ²-regularization and non-hierarchical multiple kernel learning?
RQ4What are the necessary and sufficient conditions for model consistency in the proposed framework, particularly regarding the selection of relevant feature subspaces?
RQ5Can the framework be used effectively for nonlinear variable selection, especially in high-dimensional settings with complex feature interactions?

Key findings

The proposed hierarchical multiple kernel learning framework enables efficient kernel selection in polynomial time relative to the number of selected kernels, even when the total number of basis kernels is exponential.
The method achieves state-of-the-art predictive performance on both synthetic datasets and standard UCI benchmark datasets, consistently outperforming ℓ²-regularized kernel methods and standard multiple kernel learning.
Theoretical analysis shows that the framework is consistent in selecting the hull of relevant variables, meaning it reliably identifies the minimal set of feature groups that explain the signal, under appropriate conditions.
Model consistency is guaranteed when the dual norm of the residual vector is bounded by one, with explicit lower and upper bounds derived for this dual norm using the DAG structure.
The framework naturally supports nonlinear variable selection by organizing basis kernels as a directed grid (a type of DAG), allowing selection of complex, hierarchical feature interactions.
Empirical results confirm that the method is always competitive with ℓ²-regularization and often significantly improves performance, especially in high-dimensional settings with sparse true signal structures.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.