QUICK REVIEW

[论文解读] Ensemble Knowledge Distillation for Learning Improved and Efficient Networks

Umar Asif, Jianbin Tang|arXiv (Cornell University)|Sep 17, 2019

Advanced Neural Network Applications被引用 27

一句话总结

本文提出集成知识蒸馏（EKD），一种通过从多个高容量教师网络蒸馏知识来训练紧凑型多分支学生卷积神经网络（CNN）的框架。通过集成蒸馏获取多样化特征表示，并利用分支输出的集成，EKD提升了泛化能力和准确率——在CIFAR-10上达到89.66%的top-1准确率，参数量仅为ResNet110的1/3，浮点运算量（FLOPS）减少2.8倍，即使在训练数据有限的情况下也表现优异。

ABSTRACT

Ensemble models comprising of deep Convolutional Neural Networks (CNN) have shown significant improvements in model generalization but at the cost of large computation and memory requirements. In this paper, we present a framework for learning compact CNN models with improved classification performance and model generalization. For this, we propose a CNN architecture of a compact student model with parallel branches which are trained using ground truth labels and information from high capacity teacher networks in an ensemble learning fashion. Our framework provides two main benefits: i) Distilling knowledge from different teachers into the student network promotes heterogeneity in feature learning at different branches of the student network and enables the network to learn diverse solutions to the target problem. ii) Coupling the branches of the student network through ensembling encourages collaboration and improves the quality of the final predictions by reducing variance in the network outputs. Experiments on the well established CIFAR-10 and CIFAR-100 datasets show that our Ensemble Knowledge Distillation (EKD) improves classification accuracy and model generalization especially in situations with limited training data. Experiments also show that our EKD based compact networks outperform in terms of mean accuracy on the test datasets compared to state-of-the-art knowledge distillation based methods.

研究动机与目标

在不增加推理成本的前提下，提升紧凑CNN在低数据场景下的泛化能力和准确率。
解决深度集成模型在资源受限环境中计算与内存需求过高的问题。
通过从多个异构教师网络蒸馏知识，使紧凑学生网络学习到多样化、高层次的特征表示。
通过学生网络中蒸馏分支的集成，降低输出方差，提升预测质量。
设计一种联合优化真实标签对齐与来自多个教师的特征模仿的训练目标。

提出的方法

提出一种多分支学生CNN架构，其中每个分支通过知识蒸馏从不同的高容量教师网络学习。
采用一种新颖的训练目标，同时最小化真实标签的交叉熵损失与教师和学生特征图之间的蒸馏损失。
在推理阶段使用分支预测的集成平均，以降低输出方差并提升鲁棒性。
通过在不同教师输出上分别训练每个学生分支，引入特征学习的异质性，促进多样化表示。
使用教师网络的软标签进行知识蒸馏，并通过温度缩放提升特征迁移效果。
采用基于ResNet的架构（如ResNet8）作为学生模型，更深的ResNets（如ResNet110）作为教师，以确保公平比较。

实验结果

研究问题

RQ1从多个多样化教师网络进行知识蒸馏，能否提升紧凑学生网络的泛化能力和准确率？
RQ2在学生网络中对并行分支的蒸馏输出进行集成，能否降低预测方差并提升最终准确率？
RQ3在训练数据有限的情况下，EKD相较于标准知识蒸馏和非蒸馏模型表现如何？
RQ4与大型集成模型相比，EKD能否在显著减少模型大小和FLOPS的前提下实现SOTA性能？
RQ5从多个教师进行蒸馏在多大程度上能改善学习到的特征嵌入的类别分离效果？

主要发现

基于EKD的7分支ResNet8在CIFAR-10上达到89.66%的top-1准确率，优于所有对比的KD方法，包括TAKD（88.01%）和MUTUAL（87.71%）。
仅使用10%的训练数据时，EKD-based ResNet8的准确率优于ResNet110，且参数量减少3倍，FLOPS减少2.8倍。
TSNE可视化显示，EKD模型生成的类别嵌入分离效果优于非蒸馏模型，尤其在低数据条件下更为显著。
消融实验确认，多教师蒸馏与分支集成均对性能提升有显著贡献，两者结合效果最佳。
所提出的训练目标有效平衡了标签对齐与特征模仿，使学生网络能够从多个教师中学习到多样化且具有判别性的表示。
该框架在显著提升泛化能力的同时保持了较低的推理成本，适用于边缘和移动设备应用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。