[Paper Review] Convergent Block Coordinate Descent for Training Tikhonov Regularized Deep Neural Networks
This paper proposes a convergent block coordinate descent (BCD) algorithm for training ReLU-activated deep neural networks with Tikhonov regularization, reformulating the non-convex training problem as a multi-convex optimization via lifting ReLU into a higher-dimensional space. The method guarantees global convergence to a stationary point with R-linear rate and achieves better test error rates than SGD on MNIST, demonstrating improved generalization.
By lifting the ReLU function into a higher dimensional space, we develop a smooth multi-convex formulation for training feed-forward deep neural networks (DNNs). This allows us to develop a block coordinate descent (BCD) training algorithm consisting of a sequence of numerically well-behaved convex optimizations. Using ideas from proximal point methods in convex analysis, we prove that this BCD algorithm will converge globally to a stationary point with R-linear convergence rate of order one. In experiments with the MNIST database, DNNs trained with this BCD algorithm consistently yielded better test-set error rates than identical DNN architectures trained via all the stochastic gradient descent (SGD) variants in the Caffe toolbox.
Motivation & Objective
- To address the non-convexity and vanishing gradient issues in training deep neural networks (DNNs) with ReLU activations.
- To develop a globally convergent optimization method for DNNs that avoids local minima and saddle points.
- To improve generalization performance by formulating training as a multi-convex problem using Tikhonov regularization.
- To provide theoretical convergence guarantees with R-linear rate for a block coordinate descent algorithm in the DNN setting.
- To empirically validate that the proposed method outperforms standard SGD-based solvers in test accuracy.
Proposed method
- Lifts the ReLU activation into a higher-dimensional space to create a smooth, multi-convex formulation of the DNN training problem.
- Introduces a Tikhonov regularization matrix that encodes network architecture and weights, enabling a structured decomposition of the objective.
- Decomposes the training objective into three convex sub-problems: Tikhonov-regularized inverse problem, least-squares regression, and classifier learning.
- Applies block coordinate descent (BCD) by sequentially optimizing over three blocks: hidden unit weights, output weights, and network parameters.
- Uses proximal point method ideas to ensure numerical stability and convergence in each sub-optimization step.
- Employs a line search strategy with diminishing step sizes to guarantee convergence, with theoretical analysis showing R-linear convergence of order one.
Experimental results
Research questions
- RQ1Can a Tikhonov-regularized, multi-convex reformulation of ReLU-based DNNs enable global convergence during training?
- RQ2Does a block coordinate descent algorithm applied to this reformulated problem converge globally to a stationary point with a provable convergence rate?
- RQ3Can this method outperform standard SGD-based training in terms of test accuracy and generalization?
- RQ4How does the proposed method mitigate the vanishing gradient problem in deep networks?
- RQ5Is the convergence rate of the BCD algorithm R-linear with order one under the proposed formulation?
Key findings
- The proposed BCD algorithm globally converges to a stationary point with R-linear convergence rate of order one, as proven via proximal point method analysis.
- The method is numerically stable and does not suffer from the vanishing gradient problem due to the long-range dependency modeling within each sub-problem.
- On the MNIST dataset, DNNs trained with the BCD algorithm achieved consistently lower test-set error rates than identical architectures trained with all SGD variants in the Caffe toolbox.
- The Tikhonov regularization matrix effectively encodes network architecture and parameterization, enabling a structured, convex decomposition of the training objective.
- The algorithm is suitable for training both dense and sparse DNNs, demonstrating versatility in network topology.
- The convergence analysis holds under the assumption that each sub-problem has a unique solution, and the step size sequence satisfies specific decay conditions (e.g., θt = 1/t^p with p > 1).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.