Skip to main content
QUICK REVIEW

[Paper Review] Online normalizer calculation for softmax

Maxim Milakov, Natalia Gimelshein|arXiv (Cornell University)|May 8, 2018
Parallel Computing and Optimization Techniques8 references20 citations
TL;DR

This paper proposes an online normalizer calculation for the Softmax function that reduces memory accesses from four to three per element by computing the maximum and normalization term in a single pass, enabling faster inference. Benchmarks on Tesla V100 show up to 1.3x speedup for Softmax alone and up to 5x for fused Softmax+TopK, with performance gains driven by reduced memory bandwidth pressure.

ABSTRACT

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

Motivation & Objective

  • To reduce memory access overhead in Softmax computation, which is a performance bottleneck in deep learning inference.
  • To address the lack of targeted optimization for the classical Softmax function despite numerous alternatives.
  • To enable efficient fusion of Softmax and TopK operations by co-locating normalization and selection logic.
  • To improve numerical stability while reducing memory bandwidth usage through a single-pass algorithm.
  • To demonstrate measurable performance gains on modern hardware, particularly GPUs.

Proposed method

  • Introduces a single-pass algorithm that computes both the maximum value and the Softmax normalization term in one iteration, reducing memory accesses from four to three per element.
  • Uses a numerically stable formulation by subtracting the maximum value from all logits before exponentiation to prevent overflow/underflow.
  • Employs an incremental update rule for the normalization term: $ d_j = d_{j-1} \cdot e^{m_{j-1} - m_j} + e^{x_j - m_j} $, where $ m_j $ is the running maximum.
  • Maintains both the running maximum $ m_j $ and the normalized sum $ d_j $, updating them incrementally as each element is processed.
  • Fuses Softmax with TopK operations by tracking top-k values during the same pass, eliminating redundant memory accesses.
  • Optimizes for GPU performance by minimizing memory bandwidth usage and enabling kernel fusion.

Experimental results

Research questions

  • RQ1Can reducing memory accesses in Softmax computation lead to measurable performance improvements on modern hardware?
  • RQ2Is it possible to compute the Softmax normalizer and maximum value in a single pass without sacrificing numerical stability?
  • RQ3How does the proposed online normalizer compare to standard two- or three-pass implementations in terms of performance and accuracy?
  • RQ4To what extent can the fusion of Softmax and TopK operations improve end-to-end inference speed?
  • RQ5Does the performance gain scale with vector size and batch size on GPU architectures?

Key findings

  • The proposed online normalizer reduces memory accesses from four to three per element, achieving up to 1.3x speedup for Softmax alone on Tesla V100 with large vector sizes.
  • When fused with TopK, the combined Softmax+TopK operation achieves up to 5x speedup due to 2.5x from fusion and 2x from the online normalizer.
  • The performance improvement is most pronounced in large-batch scenarios, where memory bandwidth becomes the limiting factor.
  • Even in small-batch settings, the online Softmax achieves 1.5x–2.5x speedup due to reduced latency and memory access overhead.
  • The method maintains numerical stability and is compatible with existing deep learning frameworks, offering a drop-in optimization.
  • The performance gains are orthogonal to other Softmax optimization techniques like Hierarchical Softmax, SVD-Softmax, and Importance Sampling, allowing for further acceleration when combined.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.