Skip to main content
QUICK REVIEW

[Paper Review] Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks

Ziming Zhang, Matthew Brand|arXiv (Cornell University)|Nov 20, 2017
Stochastic Gradient Optimization Techniques32 references25 citations
TL;DR

This paper proposes a convergent block coordinate descent (BCD) algorithm for training ReLU-activated deep neural networks with Tikhonov regularization, reformulating the non-convex training problem as a multi-convex optimization via lifting ReLU into a higher-dimensional space. The method guarantees global convergence to a stationary point with R-linear rate and achieves better test error rates than SGD on MNIST, demonstrating improved generalization.

ABSTRACT

By lifting the ReLU function into a higher dimensional space, we develop a smooth multi-convex formulation for training feed-forward deep neural networks (DNNs). This allows us to develop a block coordinate descent (BCD) training algorithm consisting of a sequence of numerically well-behaved convex optimizations. Using ideas from proximal point methods in convex analysis, we prove that this BCD algorithm will converge globally to a stationary point with R-linear convergence rate of order one. In experiments with the MNIST database, DNNs trained with this BCD algorithm consistently yielded better test-set error rates than identical DNN architectures trained via all the stochastic gradient descent (SGD) variants in the Caffe toolbox.

Motivation & Objective

  • To address the non-convexity and vanishing gradient issues in training deep neural networks (DNNs) with ReLU activations.
  • To develop a globally convergent optimization method for DNNs that avoids local minima and saddle points.
  • To improve generalization performance by formulating training as a multi-convex problem using Tikhonov regularization.
  • To provide theoretical convergence guarantees with R-linear rate for a block coordinate descent algorithm in the DNN setting.
  • To empirically validate that the proposed method outperforms standard SGD-based solvers in test accuracy.

Proposed method

  • Lifts the ReLU activation into a higher-dimensional space to create a smooth, multi-convex formulation of the DNN training problem.
  • Introduces a Tikhonov regularization matrix that encodes network architecture and weights, enabling a structured decomposition of the objective.
  • Decomposes the training objective into three convex sub-problems: Tikhonov-regularized inverse problem, least-squares regression, and classifier learning.
  • Applies block coordinate descent (BCD) by sequentially optimizing over three blocks: hidden unit weights, output weights, and network parameters.
  • Uses proximal point method ideas to ensure numerical stability and convergence in each sub-optimization step.
  • Employs a line search strategy with diminishing step sizes to guarantee convergence, with theoretical analysis showing R-linear convergence of order one.

Experimental results

Research questions

  • RQ1Can a Tikhonov-regularized, multi-convex reformulation of ReLU-based DNNs enable global convergence during training?
  • RQ2Does a block coordinate descent algorithm applied to this reformulated problem converge globally to a stationary point with a provable convergence rate?
  • RQ3Can this method outperform standard SGD-based training in terms of test accuracy and generalization?
  • RQ4How does the proposed method mitigate the vanishing gradient problem in deep networks?
  • RQ5Is the convergence rate of the BCD algorithm R-linear with order one under the proposed formulation?

Key findings

  • The proposed BCD algorithm globally converges to a stationary point with R-linear convergence rate of order one, as proven via proximal point method analysis.
  • The method is numerically stable and does not suffer from the vanishing gradient problem due to the long-range dependency modeling within each sub-problem.
  • On the MNIST dataset, DNNs trained with the BCD algorithm achieved consistently lower test-set error rates than identical architectures trained with all SGD variants in the Caffe toolbox.
  • The Tikhonov regularization matrix effectively encodes network architecture and parameterization, enabling a structured, convex decomposition of the training objective.
  • The algorithm is suitable for training both dense and sparse DNNs, demonstrating versatility in network topology.
  • The convergence analysis holds under the assumption that each sub-problem has a unique solution, and the step size sequence satisfies specific decay conditions (e.g., θt = 1/t^p with p > 1).

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.