[Paper Review] Fix your classifier: the marginal value of training the last weight layer
The paper shows that replacing or fixing the final linear classifier in CNNs with a fixed orthogonal transform (such as Hadamard) yields comparable accuracy while dramatically reducing trainable parameters and potentially speeding up inference.
Neural networks are commonly used as models for classification for a wide variety of tasks. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more resources. In this work we argue that this classifier can be fixed, up to a global scale constant, with little or no loss of accuracy for most tasks, allowing memory and computational benefits. Moreover, we show that by initializing the classifier with a Hadamard matrix we can speed up inference as well. We discuss the implications for current understanding of neural network models.
Motivation & Objective
- Motivate reducing the parameter count in the final classification layer of CNNs without sacrificing accuracy.
- Propose fixed linear transforms (orthogonal, Hadamard) as the final classifier and study training dynamics.
- Evaluate fixed classifiers on CIFAR-10/100, ImageNet, and language modeling to assess generality.
- Analyze practical implications for large-scale datasets and deployment on memory- and compute-constrained devices.
Proposed method
- Replace the trainable W in the final affine classifier with a fixed orthonormal projection Q (columns q_i orthogonal, unit norm).
- Normalize the final representation x to unit L2-norm, and introduce a single scalar α to scale the softmax inputs, plus biases b. s_i = softmax(α q_i·x̂ + b_i).
- Optionally use a fixed Hadamard matrix Ĥ (C×N) with ±1 entries as the final classifier (y = Ĥ x̂ + b) to avoid storing coefficients and simplify computation.
- Explore a cosine-angle loss as an alternative to softmax.
- Experiment with CIFAR-10/100, ImageNet with various architectures (ResNet, DenseNet, ShuffleNet), and language modeling on WikiText-2 to compare learned vs fixed classifiers.
Experimental results
Research questions
- RQ1Can a fixed final classifier maintain comparable accuracy to a learned classifier across common CNN tasks?
- RQ2What is the impact of fixing the classifier on training dynamics, parameter count, and memory usage?
- RQ3Does a Hadamard or orthogonal fixed transform provide computational/memory benefits without sacrificing performance?
- RQ4Are there domains (e.g., language modeling) where fixed classifiers are less effective due to class correlations or embedding roles?
Key findings
- Fixed classifiers achieve nearly identical validation accuracy to learned classifiers on CIFAR-10/100 and ImageNet across multiple architectures.
- Removing trainable parameters from the final layer reduces the portion of trainable parameters substantially (e.g., 0.07% for CIFAR-10 ResNet56; 4.2% for CIFAR-100 DenseNet; 8.01% for ImageNet ResNet50; 11.76% for ImageNet DenseNet169; 52.56% for ShuffleNet on ImageNet).
- Using a fixed Hadamard matrix as the final classifier provides memoryBenefits and allows full parameter removal from the final layer in certain configurations without loss in accuracy.
- In language modeling (WikiText-2), fixed random orthogonal embeddings perform poorly compared to learned embeddings, but pre-trained word2vec embeddings with a fixed transform reduce parameters by ~89% and yield only modest perplexity degradation.
- Across ImageNet and CIFAR tasks, the fixed classifier converges with similar training/validation behavior, and a single scale parameter α can be learned to match performance.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.