QUICK REVIEW

[Paper Review] FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas|arXiv (Cornell University)|Dec 19, 2014

Advanced Neural Network Applications27 references2,032 citations

TL;DR

This paper extends knowledge distillation by training deeper, thinner student networks (FitNets) using intermediate hints from a teacher network, enabling high accuracy with far fewer parameters and faster inference.

ABSTRACT

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

Motivation & Objective

Motivate compression of wide, deep networks for efficiency in memory and computation.
Introduce a method to train thin and deep student networks using teacher-derived hints.
Leverage knowledge distillation combined with intermediate representations to guide training.
Demonstrate that deeper, thinner models can match or exceed teacher performance on standard benchmarks.
Show practical stage-wise training and curriculum-learning perspectives for better optimization.

Proposed method

Review of Knowledge Distillation (KD) where a student mimics the softened outputs of a teacher using a temperature parameter tau.
Introduce hint-based training where a hidden layer (the hint) from the teacher guides a corresponding (guided) hidden layer in the student via a regressor when dimensions differ.
Use a convolutional regressor to map the student guided layer to the teacher hint layer, reducing parameter growth.
Describe a stage-wise training procedure: first train up to the guided layer using hints, then train the full FitNet with KD loss.
Formulate loss L_KD combining standard cross-entropy with the softened teacher output term, balanced by lambda; L_HT for hint-based mapping between teacher hints and student guided representations.
Discuss relation to Curriculum Learning, where the teacher’s confidence acts as a curriculum signal and lambda is annealed during training.

Experimental results

Research questions

RQ1Can deeper, thinner student networks be effectively trained by exploiting intermediate teacher representations as hints?
RQ2Does hint-based training plus KD outperform standard backpropagation and pure KD in training deep, thin networks?
RQ3What is the trade-off between model depth, parameter count, and inference efficiency when using FitNets?
RQ4How well do FitNets generalize across standard vision benchmarks compared to their teachers and other compression methods?

Key findings

Deep, thin student networks can outperform their teacher while using far fewer parameters and computations.
Hint-based training (HT) enables training of networks with greater depth than KD alone, yielding better generalization.
On CIFAR-10, a deep 11-layer FitNet with about 250K parameters achieves 89.01% accuracy, outperforming the teacher and achieving substantial speedups and compression.
On CIFAR-10 with larger FitNets (e.g., 11–19 layers), accuracy reaches 91.61% with about 2.5M parameters, showing strong improvement over teacher (~9M parameters) in accuracy despite much lower capacity.
On CIFAR-100, FitNets again outperform teachers, with strong parameter reductions (about 3x fewer) and competitive accuracy.
On SVHN, FitNets with ~30K–1.5M parameters reach competitive error rates close to or better than the teacher, while using a fraction of the parameters.
MNIST tests show that HT plus KD yields substantial gains, with a FitNet achieving 0.51% misclassification error using 12x fewer parameters than the teacher.
AFLW experiments indicate that hints provide noticeable improvements in thinner architectures, with HT outperforming KD in several cases.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.