QUICK REVIEW

[Paper Review] Beyond Backprop: Online Alternating Minimization with Auxiliary Variables

Anna Choromanska, Benjamin Cowen|arXiv (Cornell University)|Jun 24, 2018

Stochastic Gradient Optimization Techniques33 citations

TL;DR

This paper proposes a novel online (stochastic/mini-batch) alternating minimization (AM) method for training deep neural networks using auxiliary variables, avoiding backpropagation's gradient chain rule. It provides the first theoretical convergence guarantees for AM in stochastic settings and achieves competitive accuracy on MNIST, CIFAR-10, and HIGGS datasets, with runtime comparable to SGD and Adam.

ABSTRACT

Despite significant recent advances in deep neural networks, training them remains a challenge due to the highly non-convex nature of the objective function. State-of-the-art methods rely on error backpropagation, which suffers from several well-known issues, such as vanishing and exploding gradients, inability to handle non-differentiable nonlinearities and to parallelize weight-updates across layers, and biological implausibility. These limitations continue to motivate exploration of alternative training algorithms, including several recently proposed auxiliary-variable methods which break the complex nested objective function into local subproblems. However, those techniques are mainly offline (batch), which limits their applicability to extremely large datasets, as well as to online, continual or reinforcement learning. The main contribution of our work is a novel online (stochastic/mini-batch) alternating minimization (AM) approach for training deep neural networks, together with the first theoretical convergence guarantees for AM in stochastic settings and promising empirical results on a variety of architectures and datasets.

Motivation & Objective

Address the limitations of backpropagation, such as vanishing gradients, non-differentiable nonlinearities, and biological implausibility.
Overcome the constraints of existing auxiliary-variable methods, which are mostly offline (batch) and unsuitable for online or continual learning.
Develop a memory-efficient, online stochastic alternating minimization framework that enables layer-wise, local weight updates without backpropagation.
Provide the first theoretical convergence guarantees for alternating minimization in a stochastic (mini-batch) setting.
Demonstrate empirical effectiveness across diverse architectures and datasets, including fully connected networks and LeNet-5 on MNIST and CIFAR-10.

Proposed method

Introduce auxiliary variables per layer to decouple the deep network's nested objective into local subproblems, enabling alternating minimization over weights and activations.
Propose two variants: AM-Adam, which uses adaptive gradient updates for weights, and AM-mem, which leverages a surrogate objective inspired by online dictionary learning (Mairal et al., 2009).
Perform alternating optimization: first update auxiliary variables (activations) for fixed weights, then update all weights in parallel across layers using local information.
Use mini-batch stochastic updates to enable online learning, avoiding full-batch computation and enabling scalability to large datasets.
Avoid Lagrange multipliers, reducing memory usage to levels comparable to standard SGD, while maintaining the benefits of local, biologically plausible updates.
Formulate the optimization problem such that weight updates depend only on local signals and current layer activations, enhancing computational and biological plausibility.

Experimental results

Research questions

RQ1Can alternating minimization with auxiliary variables be adapted to an online, stochastic (mini-batch) setting to enable continual and scalable deep learning?
RQ2Does the proposed online AM method achieve convergence in a stochastic setting, and can theoretical guarantees be established?
RQ3How does the performance of online AM compare to standard backprop-based methods like Adam and SGD across different architectures and datasets?
RQ4Can the method handle non-differentiable nonlinearities and avoid the vanishing gradient problem without relying on backpropagation?
RQ5What is the computational efficiency and memory footprint of the proposed method relative to existing backprop and auxiliary-variable baselines?

Key findings

The proposed online AM method achieves test accuracy of 97.8% on MNIST with a fully connected network, comparable to Adam and SGD, despite avoiding backpropagation.
On CIFAR-10, the AM-Adam variant achieved 87.2% accuracy with 500 units per layer, outperforming SGD and matching Adam’s performance under optimal hyperparameters.
On the HIGGS dataset, AM-Adam matched Adam’s accuracy of 70.1% with the same learning rate and architecture, demonstrating robustness on high-dimensional, real-world data.
Runtime measurements show that AM-Adam is nearly on par with Adam and SGD—e.g., 443 seconds for 450 mini-batches on LeNet-5/MNIST—indicating computational feasibility.
The method achieves convergence in the stochastic setting, with formal theoretical guarantees provided, marking the first such result for alternating minimization in online deep learning.
AM-mem and AM-Adam variants show consistent performance across multiple weight initializations and datasets, with minimal hyperparameter sensitivity compared to baseline methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.