Skip to main content
QUICK REVIEW

[Paper Review] Least Squares Revisited: Scalable Approaches for Multi-class Prediction

Alekh Agarwal, Sham M. Kakade|arXiv (Cornell University)|Oct 7, 2013
Machine Learning and Algorithms33 references21 citations
TL;DR

This paper introduces scalable, parameter-free second-order least-squares algorithms for large-scale multi-class classification, leveraging a preconditioned Hessian approximation to achieve fast convergence independent of data condition number. The method outperforms first-order approaches like Liblinear and Vowpal Wabbit by orders of magnitude in speed on MNIST and CIFAR-10, achieving state-of-the-art accuracy with simple MATLAB code, while also enabling joint learning of weights and link functions in GLMs.

ABSTRACT

This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we present a scalable stagewise variant of our approach, which achieves dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.

Motivation & Objective

  • To develop robust, scalable algorithms for large-scale multi-class classification where both the number of examples $n$ and features $d$ are large.
  • To overcome the slow convergence of first-order methods on ill-conditioned data, especially in high-dimensional vision tasks like MNIST and CIFAR-10.
  • To design a parameter-free, metric-free second-order method that avoids line searches and uses only $d \times d$ matrix operations, unlike traditional Hessian-based methods.
  • To extend the method to jointly estimate model weights and the link function in GLMs, enabling iterative refinement via prediction-based feature learning.
  • To develop a stagewise block-coordinate variant that incrementally fits small feature subsets, enabling scalability to high-dimensional problems.

Proposed method

  • Uses a majorization of the Hessian based on the empirical second moment $\widehat{\Sigma} = \frac{1}{n}\sum_i x_i x_i^T$ as a preconditioner, avoiding $\mathcal{O}(dk \times dk)$ matrix operations.
  • Employs a simple, parameter-free second-order update rule that is computationally efficient and converges independently of the data’s condition number.
  • Introduces a stagewise block-coordinate descent procedure that fits least-squares models on small, incremental subsets of features, reducing per-iteration cost.
  • Extends the framework to jointly learn weights and the link function in GLMs under parametric assumptions, using isotonic regression-inspired techniques.
  • Applies the method to multi-label settings by modifying the projection step to handle hypercube-valued labels instead of simplex constraints.
  • Employs a greedy feature selection strategy in the stagewise variant to prioritize informative features and improve convergence speed.

Experimental results

Research questions

  • RQ1Can second-order least-squares methods be made scalable and parameter-free for large-scale multi-class prediction?
  • RQ2How does the performance of second-order methods compare to first-order methods like Vowpal Wabbit and Liblinear on ill-conditioned vision datasets such as MNIST and CIFAR-10?
  • RQ3Can a stagewise block-coordinate approach effectively scale second-order methods to high-dimensional problems without incurring prohibitive computational costs?
  • RQ4Is it possible to jointly learn the link function and model weights in a GLM framework with theoretical convergence guarantees, even under non-convexity?
  • RQ5How effective is the method on well-conditioned, sparse text datasets like NEWS20 and RCV1, where first-order methods typically dominate?

Key findings

  • On MNIST, the stagewise variant achieved state-of-the-art accuracy with a simple MATLAB implementation, running at least 10 times faster than highly optimized C-based Liblinear and Vowpal Wabbit.
  • On CIFAR-10, the method achieved over 85% accuracy using linear regression on standard convolutional features, outperforming many deep learning baselines without data augmentation.
  • With only 400 filters and polynomial features, the method reached over 80% accuracy on CIFAR-10 extremely quickly, demonstrating fast convergence and scalability.
  • On well-conditioned text datasets like NEWS20 and RCV1, first-order methods (VW, Liblinear) remained competitive, but the stagewise method still achieved comparable test error with significantly reduced training time in some cases.
  • The method demonstrated robustness and scalability across diverse data types, with dramatic speedups on ill-conditioned vision data and strong performance on well-conditioned text data.
  • The joint learning of weights and link function via isotonic regression-style updates provided a novel, theoretically grounded approach to iterative model refinement in multi-class GLMs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.