QUICK REVIEW

[Paper Review] Distributed Coordinate Descent Method for Learning with Big Data

Peter Richtárik, Martin Takáč|arXiv (Cornell University)|Oct 8, 2013

Stochastic Gradient Optimization Techniques14 references58 citations

TL;DR

This paper introduces Hydra, a distributed coordinate descent method for large-scale learning problems that partitions features across cluster nodes and updates random subsets of coordinates in parallel. It provides theoretical convergence bounds dependent on data-dependent norms (σ and σ′), showing that speedup scales with τ and partition quality, and validates the method on a 3TB LASSO problem with up to 3× speedup using optimized communication protocols.

ABSTRACT

In this paper we develop and analyze Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data. We initially partition the coordinates (features) and assign each partition to a different node of a cluster. At every iteration, each node picks a random subset of the coordinates from those it owns, independently from the other computers, and in parallel computes and applies updates to the selected coordinates based on a simple closed-form formula. We give bounds on the number of iterations sufficient to approximately solve the problem with high probability, and show how it depends on the data and on the partitioning. We perform numerical experiments with a LASSO instance described by a 3TB matrix.

Motivation & Objective

To address the scalability challenge of coordinate descent methods in big data scenarios where data cannot fit on a single machine.
To design a distributed coordinate descent algorithm that leverages both inter-node and intra-node parallelism for efficient large-scale optimization.
To provide theoretical convergence guarantees for the method under general smooth and regularized loss functions.
To analyze how the method's performance depends on data structure (spectral norm σ) and partitioning (σ′), enabling practitioners to predict scalability.

Proposed method

The method partitions d features into c equal-sized blocks and assigns each to a different node in a cluster, enabling distributed storage and local computation.
At each iteration, each node independently selects τ random coordinates from its assigned partition and updates them using a closed-form formula based on partial derivatives.
The algorithm uses a hybrid parallelism model: parallel updates within each node and coordination across nodes via lightweight communication.
It introduces two key data-dependent quantities: σ (spectral norm of the data matrix) and σ′ (partition-induced norm), which determine convergence speed and scalability.
The communication protocol is optimized using asynchronous ring-based messaging (ASL) to reduce latency and improve throughput compared to traditional reduce-all operations.
The method supports both fully parallel (FP) and alternating parallel/serial (PS) communication strategies to balance computation and communication overhead.

Experimental results

Research questions

RQ1How does the convergence rate of distributed coordinate descent depend on the data structure and partitioning strategy?
RQ2Can a distributed coordinate descent method achieve near-linear speedup with increasing parallelism (τ) in big data settings?
RQ3What are the theoretical bounds on the number of iterations required to achieve ϵ-accuracy with high probability in a distributed setting?
RQ4How do data-dependent quantities σ and σ′ affect the scalability and performance of the method?
RQ5Can optimized communication protocols like ASL significantly reduce iteration time without compromising convergence?

Key findings

Hydra achieves up to 3.11× speedup over the basic RA-PS communication protocol when τ = 102, demonstrating significant performance gains with optimized communication.
The convergence rate depends on two data-dependent quantities: σ (spectral norm) and σ′ (partition-induced norm), which can be estimated a-priori to predict scalability.
For strongly convex losses, Hydra converges to an ϵ-accurate solution with probability at least 1−ρ in O((dβ/(cτμ)) log(1/(ϵρ))) iterations, where β is the stepsize and μ is the strong convexity constant.
The ASL-FP protocol reduces average iteration time to 0.025 seconds (vs. 0.040s for RA-PS), achieving 1.62× speedup at τ=10 and 3.11× at τ=102.
The method successfully solved a 3TB LASSO problem in under 30 minutes, reducing the loss by 25 orders of magnitude, demonstrating practical scalability on real-world big data.
Theoretical analysis shows that if σ is small, increasing τ leads to nearly linear speedup; if σ is large, speedup may be negligible, indicating that σ is a key predictor of parallel efficiency.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.