QUICK REVIEW

[论文解读] Zero-Shot Knowledge Distillation in Deep Networks

Gaurav Kumar Nayak, Konda Reddy Mopuri|NOT FOUND REPOSITORY (Indian Institute of Science Bangalore)|May 20, 2019

Machine Learning and Data Classification被引用 85

一句话总结

这篇论文提出一个数据无关的知识蒸馏框架，通过 Dirichlet 建模 softmax 空间从教师模型合成 Data Impressions，实现无需训练数据的 KD，在 MNIST、Fashion-MNIST 和 CIFAR-10 上实现有竞争力的性能。

ABSTRACT

Knowledge distillation deals with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance. Existing approaches use either the training data or meta-data extracted from it in order to train the Student. However, accessing the dataset on which the Teacher has been trained may not always be feasible if the dataset is very large or it poses privacy or safety concerns (e.g., bio-metric or medical data). Hence, in this paper, we propose a novel data-free method to train the Student from the Teacher. Without even using any meta-data, we synthesize the Data Impressions from the complex Teacher model and utilize these as surrogates for the original training data samples to transfer its learning to Student via knowledge distillation. We, therefore, dub our method "Zero-Shot Knowledge Distillation" and demonstrate that our framework results in competitive generalization performance as achieved by distillation using the actual training data samples on multiple benchmark datasets.

研究动机与目标

激励并解决在训练数据不可用或受限的知识蒸馏中的数据获取与隐私挑战。
提出一个数据无关的 KD 流水线，通过用 Dirichlet 分布对 softmax 空间进行建模，从教师那里合成伪样本（Data Impressions）。
从教师中推导类别相似性先验以引导数据合成并提升迁移效果。
在多个数据集上证明 ZSKD 的有效性，并与数据相关基线和元数据方法进行比较。

提出的方法

将教师的 softmax 输出对每个类用 Dirichlet 分布来建模以捕捉类别相似性。
从教师的最终层和前一层权重中计算类别相似性矩阵，以形成 Dirichlet 浓度参数。
对每个类别 k 从 Dir(K, alpha^k) 采样 softmax 向量，并通过优化输入以最小化与采样 softmax 的交叉熵来构建相应的 Data Impressions。
生成迁移集（Data Impressions）并仅使用教师与学生之间的 KD 损失进行知识蒸馏。
使用缩放因子 beta 控制 Dirichlet 浓度以及 Data Impressions 的多样性。
在蒸馏过程中可选地扩增 Data Impressions 以提升性能。

实验结果

研究问题

RQ1在没有访问任何训练数据或数据派生元数据的情况下，知识蒸馏是否可以有效进行？
RQ2从教师的 softmax 空间合成的伪样本（Data Impressions）是否可以作为用于训练学生的可行迁移集？
RQ3基于 Dirichlet 的 softmax 空间建模在捕获类间相似性以引导数据合成方面的效果如何？
RQ4零-shot KD 与数据相关的 KD 以及数据无关的基线在标准基准上有何差异？

主要发现

ZSKD 在 MNIST、Fashion-MNIST 和 CIFAR-10 上与数据驱动的 KD 相比仍具备竞争性泛化能力，同时不使用原始训练数据。
在 MNIST、Fashion-MNIST 和 CIFAR-10 上，使用 Data Impressions 的 ZSKD 优于以往的数据无关和少数据方法，并在若干设置接近使用完整数据的 KD 性能。
基于 Dirichlet 的 softmax 建模，在学习到的类别相似性矩阵的引导下，能够生成多样但相关的伪样本，能够有效将知识传递给学生。
增加迁移集大小（Data Impressions）通常会提高性能，但随着集合增长收益递减，简单数据集需要的印象数量更少即可达到有竞争力的结果。
Data Impressions 常常在视觉上与真实数据不同，但仍能诱导出有意义的 KD，有时能够捕捉可辨认的对象模式。
使用混合的 Beta 缩放 Dirichlet 参数（beta 值 0.1 和 1.0）在实践中提升了多样性和性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。