QUICK REVIEW

[Paper Review] Least Squares Revisited: Scalable Approaches for Multi-class Prediction

Alekh Agarwal, Sham M. Kakade|arXiv (Cornell University)|Oct 7, 2013

Machine Learning and Algorithms33 references21 citations

TL;DR

This paper introduces scalable, parameter-free second-order least-squares algorithms for large-scale multi-class classification, leveraging a preconditioned Hessian approximation to achieve fast convergence independent of data condition number. The method outperforms first-order approaches like Liblinear and Vowpal Wabbit by orders of magnitude in speed on MNIST and CIFAR-10, achieving state-of-the-art accuracy with simple MATLAB code, while also enabling joint learning of weights and link functions in GLMs.

ABSTRACT

This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we present a scalable stagewise variant of our approach, which achieves dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.

Motivation & Objective

To develop robust, scalable algorithms for large-scale multi-class classification where both the number of examples $n$ and features $d$ are large.
To overcome the slow convergence of first-order methods on ill-conditioned data, especially in high-dimensional vision tasks like MNIST and CIFAR-10.
To design a parameter-free, metric-free second-order method that avoids line searches and uses only $d \times d$ matrix operations, unlike traditional Hessian-based methods.
To extend the method to jointly estimate model weights and the link function in GLMs, enabling iterative refinement via prediction-based feature learning.
To develop a stagewise block-coordinate variant that incrementally fits small feature subsets, enabling scalability to high-dimensional problems.

Proposed method

Uses a majorization of the Hessian based on the empirical second moment $\widehat{\Sigma} = \frac{1}{n}\sum_i x_i x_i^T$ as a preconditioner, avoiding $\mathcal{O}(dk \times dk)$ matrix operations.
Employs a simple, parameter-free second-order update rule that is computationally efficient and converges independently of the data’s condition number.
Introduces a stagewise block-coordinate descent procedure that fits least-squares models on small, incremental subsets of features, reducing per-iteration cost.
Extends the framework to jointly learn weights and the link function in GLMs under parametric assumptions, using isotonic regression-inspired techniques.
Applies the method to multi-label settings by modifying the projection step to handle hypercube-valued labels instead of simplex constraints.
Employs a greedy feature selection strategy in the stagewise variant to prioritize informative features and improve convergence speed.

Experimental results

Research questions

RQ1Can second-order least-squares methods be made scalable and parameter-free for large-scale multi-class prediction?
RQ2How does the performance of second-order methods compare to first-order methods like Vowpal Wabbit and Liblinear on ill-conditioned vision datasets such as MNIST and CIFAR-10?
RQ3Can a stagewise block-coordinate approach effectively scale second-order methods to high-dimensional problems without incurring prohibitive computational costs?
RQ4Is it possible to jointly learn the link function and model weights in a GLM framework with theoretical convergence guarantees, even under non-convexity?
RQ5How effective is the method on well-conditioned, sparse text datasets like NEWS20 and RCV1, where first-order methods typically dominate?

Key findings

On MNIST, the stagewise variant achieved state-of-the-art accuracy with a simple MATLAB implementation, running at least 10 times faster than highly optimized C-based Liblinear and Vowpal Wabbit.
On CIFAR-10, the method achieved over 85% accuracy using linear regression on standard convolutional features, outperforming many deep learning baselines without data augmentation.
With only 400 filters and polynomial features, the method reached over 80% accuracy on CIFAR-10 extremely quickly, demonstrating fast convergence and scalability.
On well-conditioned text datasets like NEWS20 and RCV1, first-order methods (VW, Liblinear) remained competitive, but the stagewise method still achieved comparable test error with significantly reduced training time in some cases.
The method demonstrated robustness and scalability across diverse data types, with dramatic speedups on ill-conditioned vision data and strong performance on well-conditioned text data.
The joint learning of weights and link function via isotonic regression-style updates provided a novel, theoretically grounded approach to iterative model refinement in multi-class GLMs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.