QUICK REVIEW

[Paper Review] Comparison of non-linear activation functions for deep neural networks on MNIST classification task

Dabal Pedamonti|arXiv (Cornell University)|Apr 8, 2018

Neural Networks and Applications2 references120 citations

TL;DR

The paper compares Leaky ReLU, ELU, and SELU against ReLU and sigmoid on MNIST, analyzes network depth up to 8 layers, and evaluates various weight initialization schemes and learning rates to assess performance and generalization.

ABSTRACT

Activation functions play a key role in neural networks so it becomes fundamental to understand their advantages and disadvantages in order to achieve better performances. This paper will first introduce common types of non linear activation functions that are alternative to the well known sigmoid function and then evaluate their characteristics. Moreover deeper neural networks will be analysed because they positively influence the final performances compared to shallower networks. They also strictly depend on the weight initialisation hence the effect of drawing weights from Gaussian and uniform distribution will be analysed making particular attention on how the number of incoming and outgoing connection to a node influence the whole network.

Motivation & Objective

Evaluate and compare how different non-linear activation functions (Leaky ReLU, ELU, SELU) perform on MNIST classification relative to sigmoid and ReLU baselines.
Investigate the impact of network depth (up to 8 hidden layers) on accuracy and loss under different weight initialization schemes.
Assess how initialisation strategies (Glorot uniform/gaussian, fan_in, fan_out) and learning rates affect training dynamics and generalization.

Proposed method

Describe and analyze activation functions (ReLU variants) and their gradients.
Conduct MNIST experiments with two hidden layers (100 units each) to compare activations.
Vary learning rate across 0.01, 0.05, 0.1, 0.2 and observe loss/accuracy on training and validation sets.
Evaluate deeper networks with ELU (and comparisons with SELU) using different weight initializations: uniform, fan_in, fan_out, gaussian.
Record validation accuracy and loss as depth increases up to 8 hidden layers.
Compare initialization methods (Glorot uniform, fan_in, fan_out, Gaussian) and report effects on accuracy and loss.

Experimental results

Research questions

RQ1Which activation functions (Leaky ReLU, ELU, SELU) yield the best accuracy and lowest loss on MNIST compared to sigmoid and ReLU baselines?
RQ2How does increasing network depth influence MNIST performance for ELU and SELU activations under different weight initialization schemes?
RQ3What is the impact of weight initialization (Glorot uniform/gaussian, fan_in, fan_out) on training dynamics and final accuracy for ELU/SELU networks?
RQ4How do learning rate choices (e.g., 0.05 vs 0.1) affect validation performance and overfitting for these activations?

Key findings

ELU generally provides better loss and accuracy than Leaky ReLU and ReLU across tested runs on MNIST.
ELU often outperforms SELU in majority of experiments, though SELU can occasionally rival ELU at certain learning rates (e.g., 0.05).
ReLU and its variants consistently outperform Sigmoid on the MNIST task.
Deeper networks with ELU can reach validation accuracy up to 0.983 (7 hidden layers with Glorot uniform initialization).
Weight initialization significantly affects final accuracy and loss; Glorot uniform often yields better average accuracy, with deeper networks increasing performance but also training time.
Gaussian weight initialization generally provides more stable validation losses and higher accuracies than uniform in SELU scenarios.
As depth increases, accuracy tends to improve and training time increases, highlighting a trade-off between performance and computation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.