QUICK REVIEW

[论文解读] Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

Gregor Urban, Krzysztof J. Geras|PolyPublie (École Polytechnique de Montréal)|Mar 17, 2016

Advanced Neural Network Applications参考文献 22被引用 107

一句话总结

该论文实证表明，在 CIFAR-10 上，即使经过蒸馏和超参数优化，浅层模型也无法匹配深度卷积网络；在同一参数预算内，需多层卷积才能达到高准确性。

ABSTRACT

Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained. Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.

研究动机与目标

在 CIFAR-10 上探究等参数预算下，浅层网络是否能够匹配深度卷积神经网络。
评估在带有贝叶斯超参数优化的蒸馏（师生）框架下，浅层 CNN 的效果。
比较在硬标签 vs 来自深度教师集合的软目标上训练的浅层模型的性能。
量化浅层模型需要多少卷积层才能接近深层模型的准确性。

提出的方法

在 CIFAR-10 上训练最先进的深度卷积教师集成（16 个 CNN），并进行广泛的数据增强。
利用蒸馏训练浅层学生模型，使其模仿集成的 logits（软目标）而不是硬标签 one-hot 标签。
在包含 0–1 个卷积层的浅层模型上应用线性瓶颈以加速学习。
对学习率、动量、权重尺度和网络宽度进行贝叶斯超参数优化（利用 Spearmint 的高斯过程）。
通过基于 HSV 的移位和随机裁剪/翻转进行数据增强，以为模型压缩创建大规模迁移集。
在不同体系结构（1–4 个卷积层、不同参数预算）下评估浅层学生相对于深度教师集成的表现。

实验结果

研究问题

RQ1当通过蒸馏训练时，参数数量相当的浅层网络能否在 CIFAR-10 上达到接近深度模型的准确性？
RQ2带软目标的蒸馏是否能使浅层架构缩小与深度卷积网络在 CIFAR-10 上的差距？
RQ3在固定参数预算下，浅层模型需要多少卷积层才能达到有竞争力的性能？
RQ4数据增强和超参数优化在训练有效的浅层模仿模型中的作用是什么？

主要发现

模型	体系结构	# 参数	准确度
1 conv. layer	c-mp-lfc-fc	10M	84.6%
2 conv. layer	c-mp-c-mp-fc	10M	88.9%
3 conv. layer	c-mp-c-mp-c-mp-fc	10M	91.2%
4 conv. layer	c-mp-c-c-mp-c-mp-fc	10M	91.75%
Teacher CNN 1st	76 c^2 -mp-126 c^2 -mp-148 c^4 -mp-1200 fc^2	5.3M	92.78%
Teacher CNN 2nd	96 c^2 -mp-171 c^2 -mp-128 c^4 -mp-512 fc^2	2.5M	92.77%
Teacher CNN 3rd	54 c^2 -mp-158 c^2 -mp-189 c^4 -mp-1044 fc^2	5.8M	92.67%
Ensemble of 16 CNNs	c^2 -mp- c^2 -mp- c^4 -mp- fc^2	83.4M	93.8%
Teacher CNN (*)	128c-mp-128c-mp-128c-mp-1k fc	2.1M	88.0%
Ensemble, 4 CNNs (*)	128c-mp-128c-mp-128c-mp-1k fc	8.6M	89.0%

在相同参数预算内，即使有蒸馏，浅层模型也无法匹配深度卷积网络。
由16个深度 CNN 组成的集合在验证集上达到 93.8% 的准确率（验证集 94.0%；最终测试集 93.8%）。
浅层学生 CNN 需要多个卷积层（3–4 层）才能达到较高的 CIFAR-10 准确率；1–2 层卷积时落后于深层模型。
蒸馏显著改进浅层模型相对于硬标签训练，尤其是对于非常浅的架构（如 1 个卷积层）。
不含卷积的浅层全连接多层感知机表现显著较差（如蒸馏下约70%+，而卷积神经网络则超过90%）。
即使进行超参数优化和蒸馏，仍存在明显的“卷积差距”；更深的卷积网络可以缩小但对于浅层学生仍未完全弥合。
最佳单层 MLP 达到 70.2% 的准确率，说明非卷积的浅层模型在 CIFAR-10 上的极限，即使使用蒸馏也如此。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。