QUICK REVIEW

[Paper Review] Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks

Ziming Zhang, Matthew Brand|arXiv (Cornell University)|Nov 20, 2017

Stochastic Gradient Optimization Techniques32 references25 citations

TL;DR

This paper proposes a convergent block coordinate descent (BCD) algorithm for training ReLU-activated deep neural networks with Tikhonov regularization, reformulating the non-convex training problem as a multi-convex optimization via lifting ReLU into a higher-dimensional space. The method guarantees global convergence to a stationary point with R-linear rate and achieves better test error rates than SGD on MNIST, demonstrating improved generalization.

ABSTRACT

By lifting the ReLU function into a higher dimensional space, we develop a smooth multi-convex formulation for training feed-forward deep neural networks (DNNs). This allows us to develop a block coordinate descent (BCD) training algorithm consisting of a sequence of numerically well-behaved convex optimizations. Using ideas from proximal point methods in convex analysis, we prove that this BCD algorithm will converge globally to a stationary point with R-linear convergence rate of order one. In experiments with the MNIST database, DNNs trained with this BCD algorithm consistently yielded better test-set error rates than identical DNN architectures trained via all the stochastic gradient descent (SGD) variants in the Caffe toolbox.

Motivation & Objective

To address the non-convexity and vanishing gradient issues in training deep neural networks (DNNs) with ReLU activations.
To develop a globally convergent optimization method for DNNs that avoids local minima and saddle points.
To improve generalization performance by formulating training as a multi-convex problem using Tikhonov regularization.
To provide theoretical convergence guarantees with R-linear rate for a block coordinate descent algorithm in the DNN setting.
To empirically validate that the proposed method outperforms standard SGD-based solvers in test accuracy.

Proposed method

Lifts the ReLU activation into a higher-dimensional space to create a smooth, multi-convex formulation of the DNN training problem.
Introduces a Tikhonov regularization matrix that encodes network architecture and weights, enabling a structured decomposition of the objective.
Decomposes the training objective into three convex sub-problems: Tikhonov-regularized inverse problem, least-squares regression, and classifier learning.
Applies block coordinate descent (BCD) by sequentially optimizing over three blocks: hidden unit weights, output weights, and network parameters.
Uses proximal point method ideas to ensure numerical stability and convergence in each sub-optimization step.
Employs a line search strategy with diminishing step sizes to guarantee convergence, with theoretical analysis showing R-linear convergence of order one.

Experimental results

Research questions

RQ1Can a Tikhonov-regularized, multi-convex reformulation of ReLU-based DNNs enable global convergence during training?
RQ2Does a block coordinate descent algorithm applied to this reformulated problem converge globally to a stationary point with a provable convergence rate?
RQ3Can this method outperform standard SGD-based training in terms of test accuracy and generalization?
RQ4How does the proposed method mitigate the vanishing gradient problem in deep networks?
RQ5Is the convergence rate of the BCD algorithm R-linear with order one under the proposed formulation?

Key findings

The proposed BCD algorithm globally converges to a stationary point with R-linear convergence rate of order one, as proven via proximal point method analysis.
The method is numerically stable and does not suffer from the vanishing gradient problem due to the long-range dependency modeling within each sub-problem.
On the MNIST dataset, DNNs trained with the BCD algorithm achieved consistently lower test-set error rates than identical architectures trained with all SGD variants in the Caffe toolbox.
The Tikhonov regularization matrix effectively encodes network architecture and parameterization, enabling a structured, convex decomposition of the training objective.
The algorithm is suitable for training both dense and sparse DNNs, demonstrating versatility in network topology.
The convergence analysis holds under the assumption that each sub-problem has a unique solution, and the step size sequence satisfies specific decay conditions (e.g., θt = 1/t^p with p > 1).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.