QUICK REVIEW

[论文解读] NORM: Knowledge Distillation via N-to-One Representation Matching

Xiaolong Liu, Lujun Li|arXiv (Cornell University)|May 23, 2023

Advanced Neural Network Applications被引用 19

一句话总结

NORM 通过在学生网络的最后一个卷积层之后插入一个轻量级、线性特征变换，引入多对一表示匹配机制。它将学生特征扩展到 N 倍通道数，分成 N 个段，联合与教师的表示进行匹配，在推理时不增加额外开销，从而实现更多的迁移路径。

ABSTRACT

Existing feature distillation methods commonly adopt the One-to-one Representation Matching between any pre-selected teacher-student layer pair. In this paper, we present N-to-One Representation (NORM), a new two-stage knowledge distillation method, which relies on a simple Feature Transform (FT) module consisting of two linear layers. In view of preserving the intact information learnt by the teacher network, during training, our FT module is merely inserted after the last convolutional layer of the student network. The first linear layer projects the student representation to a feature space having N times feature channels than the teacher representation from the last convolutional layer, and the second linear layer contracts the expanded output back to the original feature space. By sequentially splitting the expanded student representation into N non-overlapping feature segments having the same number of feature channels as the teacher's, they can be readily forced to approximate the intact teacher representation simultaneously, formulating a novel many-to-one representation matching mechanism conditioned on a single teacher-student layer pair. After training, such an FT module will be naturally merged into the subsequent fully connected layer thanks to its linear property, introducing no extra parameters or architectural modifications to the student network at inference. Extensive experiments on different visual recognition benchmarks demonstrate the leading performance of our method. For instance, the ResNet18|MobileNet|ResNet50-1/4 model trained by NORM reaches 72.14%|74.26%|68.03% top-1 accuracy on the ImageNet dataset when using a pre-trained ResNet34|ResNet50|ResNet50 model as the teacher, achieving an absolute improvement of 2.01%|4.63%|3.03% against the individually trained counterpart. Code is available at https://github.com/OSVAI/NORM

研究动机与目标

通过超越 One-to-One Representation Matching (ORM)，推动改进的两阶段知识蒸馏。
通过在学生的最后一个卷积层后插入一个最小、可吸收的特征变换来保留教师信息。
通过特征扩展和分割实现多对一知识转移，在不增加推理时参数的情况下增加迁移路径。
在 CIFAR-100 和 ImageNet 上展示最先进的 KD 性能，并展示与基于 logits 的 KD 和对比 KD 的兼容性。

提出的方法

在学生网络的最后一个卷积层之后插入一个两线性层的特征变换（FT）。
第一层 FT 使用一个 1x1 卷积将通道扩展到 N×C_t，产生 F_se；第二层 FT 使用另一个 1x1 卷积将通道收缩回 C_s，产生 F_sc。
将 F_se 划分为 N 个不重叠的片段 F_se^i，每个具有 C_t 通道，以对 F_t 进行 N 条并行 DISTILLATION 路线；最小化 L_norm = (1/N) Σ_i ||F_se^i − F_t||_2^2。
保持 FT 线性且不含激活函数；在推理时通过 W_fc ← W_fc (W_sc W_se + I) 将 FT 合并到后续的全连接层。
总训练损失： L_total = L_ce + α L_norm (α 在 CIFAR-100 上默认为 10，在 ImageNet 上默认为 8)；可选添加 L_kd 和/或 L_crd 以进一步提升性能。
消融实验包括用于稳定训练的残差连接（线性）；N 通常设为 8；FT 模块放置在最后一个卷积层之后以最小化架构改变。

实验结果

研究问题

RQ1在两阶段 KD 中，多对一表示匹配是否能超越传统的一对一特征蒸馏？
RQ2扩展因子 N 和 L_norm 中的权重 α 如何影响性能和训练稳定性？
RQ3NORM 方法是否在同类型和不同类型的教师-学生对上具有泛化性，以及它如何与基于 logits 的 KD 和对比 KD 交互？
RQ4NORM 在推理时的影响是什么，FT 能否被吸收进分类器而不增加额外参数？

主要发现

在 CIFAR-100 上使用同类型的教师-学生对时，NORM 相较基线平均提升 2.88% 的 top-1 准确率。
在 CIFAR-100 上使用不同类型的教师-学生对时，NORM 平均提升 5.81% 的 top-1 准确率，最大达到 6.92%。
在 ImageNet 上，ResNet18 与 ResNet34 教师达到 72.14% 的 top-1（从 70.13% 提升），绝对增益 2.01%。
在 ImageNet 上，MobileNet 与 ResNet50 教师达到 74.26% 的 top-1（从 69.63% 提升），绝对增益 4.63%。
NORM 在 ImageNet 上通常达到最佳或具有竞争力的结果，相较主流 KD 方法，当与 vanilla KD 或 contrastive KD 结合时可进一步提升（例如 NORM+KD 和 NORM+CRD）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。