QUICK REVIEW

[Paper Review] An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family

Alexandre de Brébisson, Pascal Vincent|arXiv (Cornell University)|Nov 16, 2015

Stock Market Forecasting Methods14 references52 citations

TL;DR

This paper investigates softmax alternatives within the spherical loss family—specifically log-spherical softmax and a novel log-Taylor softmax—demonstrating that these alternatives outperform standard log-softmax on low-dimensional classification tasks like MNIST and CIFAR10, despite underperforming on high-dimensional language modeling benchmarks such as One Billion Word. The method enables efficient $O(d^2)$ weight updates via spherical family properties, offering a scalable alternative to standard softmax with improved performance in low-output-dimension settings.

ABSTRACT

In a multi-class classification problem, it is standard to model the output of a neural network as a categorical distribution conditioned on the inputs. The output must therefore be positive and sum to one, which is traditionally enforced by a softmax. This probabilistic mapping allows to use the maximum likelihood principle, which leads to the well-known log-softmax loss. However the choice of the softmax function seems somehow arbitrary as there are many other possible normalizing functions. It is thus unclear why the log-softmax loss would perform better than other loss alternatives. In particular Vincent et al. (2015) recently introduced a class of loss functions, called the spherical family, for which there exists an efficient algorithm to compute the updates of the output weights irrespective of the output size. In this paper, we explore several loss functions from this family as possible alternatives to the traditional log-softmax. In particular, we focus our investigation on spherical bounds of the log-softmax loss and on two spherical log-likelihood losses, namely the log-Spherical Softmax suggested by Vincent et al. (2015) and the log-Taylor Softmax that we introduce. Although these alternatives do not yield as good results as the log-softmax loss on two language modeling tasks, they surprisingly outperform it in our experiments on MNIST and CIFAR-10, suggesting that they might be relevant in a broad range of applications.

Motivation & Objective

To evaluate whether softmax alternatives from the spherical loss family can outperform standard log-softmax in multi-class classification.
To investigate the empirical performance of spherical losses, including log-spherical softmax and a newly proposed log-Taylor softmax, across diverse datasets.
To understand why log-softmax dominates in high-dimensional settings like language modeling, while spherical losses perform better in low-dimensional tasks.
To analyze the trade-offs between training efficiency, model capacity, and generalization across different loss functions.

Proposed method

The spherical loss family is defined using only the target class activation $o_c$, the sum $s = \sum o_i$, and the squared norm $q = \|\mathbf{o}\|^2$, enabling $O(d^2)$ weight updates instead of $O(dD)$.
Spherical upper bounds of the log-softmax loss are derived using convex analysis, providing alternative surrogate losses that maintain the same minimum.
The log-Taylor Softmax is introduced as a spherical loss based on a Taylor expansion of the log-sum-exp function, avoiding the need for a temperature hyperparameter $\epsilon$.
The log-spherical softmax is adopted from prior work, using a spherical normalization that depends on $q$ and $o_c$.
Experiments compare these losses on MNIST, CIFAR10/100, and language modeling tasks using fixed architectures to isolate loss function effects.
Model depth and non-linearities (e.g., ReLU, exponential, batch normalization) are varied to assess their impact on spherical loss performance.

Experimental results

Research questions

RQ1Do spherical loss-based alternatives to log-softmax achieve better generalization than standard log-softmax on low-dimensional classification tasks?
RQ2Why does log-softmax outperform spherical losses in high-dimensional language modeling tasks despite their efficiency?
RQ3How do the hyperparameters and numerical stability of spherical softmax compare to the proposed log-Taylor softmax?
RQ4Can architectural modifications such as deeper networks or stronger non-linearities improve the performance of spherical losses?
RQ5What is the role of the exponential non-linearity in softmax for discriminative feature competition in large output spaces?

Key findings

On MNIST and CIFAR10, the log-Taylor Softmax and log-spherical softmax outperform log-softmax, achieving lower test error and higher accuracy with fixed architectures.
On the One Billion Word dataset, log-softmax achieves a perplexity of 19.2 with two hidden layers, while log-spherical softmax reaches 28.4 and log-Taylor softmax 28.9, indicating a significant performance gap.
The SimLex-999 score for log-softmax improves with depth (0.318 with two layers), while spherical losses show only modest gains (0.262–0.265), suggesting limited capacity for semantic similarity modeling.
The log-Taylor Softmax outperforms log-spherical softmax in both accuracy and stability, as it does not require a temperature hyperparameter $\epsilon$ and exhibits a small asymmetry that may aid learning.
Despite architectural enhancements like deeper networks, ReLU replacements with exponentials, and batch normalization, spherical losses did not surpass log-softmax on high-dimensional tasks.
The qualitative shift in performance—where spherical losses outperform log-softmax in low dimensions but underperform in high dimensions—remains unexplained, suggesting a fundamental difference in inductive bias.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.