QUICK REVIEW

[Paper Review] Breaking the Curse of Dimensionality with Convex Neural Networks

Francis Bach|arXiv (Cornell University)|Dec 30, 2014

Stochastic Gradient Optimization Techniques61 references321 citations

TL;DR

This paper proposes a convex formulation of single-hidden-layer neural networks with non-decreasing, positively homogeneous activation functions (e.g., ReLU), enabling provable generalization performance without exponential sample complexity. By using non-Euclidean regularization on output weights and relaxing the non-convex subproblem via semidefinite programming, the method adapts to low-dimensional structures and enables non-linear variable selection even in high-dimensional settings with potentially exponential input dimensions.

ABSTRACT

We consider neural networks with a single hidden layer and non-decreasing homogeneous activa-tion functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing norms on the input weights, we show that high-dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of ob-servations. In addition, we provide a simple geometric interpretation to the non-convex problem of addition of a new unit, which is the core potentially hard computational element in the framework of learning from continuously many basis functions. We provide simple conditions for convex relaxations to achieve the same generalization error bounds, even when constant-factor approxi-mations cannot be found (e.g., because it is NP-hard such as for the zero-homogeneous activation function). We were not able to find strong enough convex relaxations and leave open the existence or non-existence of polynomial-time algorithms.

Motivation & Objective

To address the curse of dimensionality in non-parametric learning by developing a convex optimization framework for single-hidden-layer neural networks.
To enable adaptive learning of underlying low-dimensional structures, such as dependence on a subspace or non-linear variable selection, without strong assumptions on data.
To provide theoretical guarantees on generalization error by analyzing both approximation and estimation errors in the convex formulation.
To explore convex relaxations of the non-convex subproblem of adding new hidden units, with conditions under which they preserve generalization error bounds.
To identify geometric interpretations and sufficient conditions for convex relaxations to achieve optimal performance, even without constant-factor approximations.

Proposed method

Formulates single-hidden-layer neural networks with non-decreasing, positively homogeneous activation functions (e.g., ReLU) as a convex optimization problem by letting the number of hidden units grow unbounded and applying non-Euclidean regularization on output weights.
Uses a geometric interpretation of the activation function to derive convex relaxations of the non-convex subproblem of adding a new unit, based on zonotopes and Hausdorff distance.
Proposes a d-dimensional relaxation by introducing a rank-1 matrix $ V = vv^ op $ with $ \|v\|_2 = 1 $, leading to a convex semidefinite program with constraints involving $ \|Vz_i\|_2 \leq 2u_i - v^Tz_i \leq \sqrt{z_i^T V z_i} $.
Introduces an (n+d)-dimensional relaxation using matrices $ U = uu^T $, $ V = vv^T $, and $ J = uv^T $, with constraints involving $ |\text{tr}(V z_i z_j^T)| \leq 4U_{ij} + z_j^T V z_i - 2\delta_i^T J z_j - 2\delta_j^T J z_i $.
Considers a sign vector relaxation with $ S = ss^T $, $ J = s v^T $, and constraints including $ \delta_i^T J x_i \geq \max_{j \neq i} |\delta_j^T J x_i| $ and $ (x_i^T V x_i)^{1/2} \leq \delta_i^T J x_i $.
Maximizes the objective $ \frac{1}{2n} \sum_{i=1}^n y_i (\delta_i^T J x_i + v^T x_i) $ under semidefinite constraints to yield a convex relaxation.

Experimental results

Research questions

RQ1Can convex neural networks with unbounded hidden units and non-Euclidean regularization achieve generalization error bounds independent of input dimension?
RQ2Under what conditions do convex relaxations of the non-convex subproblem of adding a new unit preserve the same generalization error bounds?
RQ3Can such convex formulations adapt to low-dimensional structures, such as dependence on a k-dimensional subspace, without prior knowledge of k?
RQ4Is non-linear variable selection possible in high-dimensional settings (even with exponentially many variables) using sparsity-inducing norms on input weights?
RQ5Do the proposed convex relaxations lead to polynomial-time algorithms with non-exponential sample complexity?

Key findings

The convex formulation achieves generalization error bounds that are adaptive to unknown low-dimensional structures, such as dependence on a k-dimensional subspace, without requiring prior knowledge of k.
When sparsity-inducing norms are applied to input weights, the method enables high-dimensional non-linear variable selection, even when the number of variables is exponential in the number of observations.
The method provides theoretical guarantees on both approximation and estimation errors, with the estimation error scaling as $ O(1/\sqrt{n}) $, though this rate is too slow to preserve bounds under polynomial-time algorithms.
Convex relaxations of the non-convex subproblem can achieve the same generalization error bounds if certain geometric conditions are met, even when constant-factor approximations are not available.
The geometric interpretation of the problem as computing the Hausdorff distance between zonotopes or solving a binary linear classification problem provides insight into the structure of the solution space.
Despite the theoretical promise, no provably polynomial-time algorithm with non-exponential sample complexity is currently known, leaving the existence or non-existence of such algorithms open.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.