QUICK REVIEW

[论文解读] Contrastive Representation Distillation

Yonglong Tian, Dilip Krishnan|arXiv (Cornell University)|Oct 23, 2019

Domain Adaptation and Few-Shot Learning参考文献 36被引用 64

一句话总结

CRD 使用对比目标在教师和学生表示之间进行传递，在模型压缩、跨模态转移和集成蒸馏任务中超越标准知识蒸馏。

ABSTRACT

Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation. Code: http://github.com/HobbitLong/RepDistiller.

研究动机与目标

倡导传递表征知识，而不仅仅是输出概率。
解决基于KL的 KD 将输出维度独立对待的局限性。
提出一个对比目标，捕捉表示中的相关性及高阶依赖。
展示 CRD 在模型压缩、跨模态转移和集成蒸馏等场景中的有效性。

提出的方法

在倒数第二层定义教师和学生的表示。
构建一个对比损失，将匹配的（x）教师-学生对拉近，同时将不匹配的对推远。
通过一个评判器 h 来估计 P(C=1|T,S)，对互信息界进行界定，并利用它最大化与对数似然相关的目标。
推导一个类似 InfoNCE 的实用目标，使用负样本的记忆库来稳定训练。
根据需要加入 KD 项或跨模态/集成扩展，得到 CRD 和 CRD+KD 变体。

实验结果

研究问题

RQ1对比表示目标是否能在超越传统 KD 的情况下改善教师到学生的知识传递？
RQ2与 KD 及其他蒸馏方法相比，CRD 在模型压缩、跨模态转移和集成蒸馏中的表现如何？
RQ3负采样和互信息界在引导表示传递中的作用是什么？

主要发现

CRD 在 CIFAR-100 和 ImageNet 上的多对教师-学生对中持续优于 KD，在 Table 1 的 CIFAR-100 上相对于 KD 的平均相对提升为57%。
CRD 同样在跨体系结构的传递（不同的教师/学生体系结构）方面优于 KD 和其他方法，如 Table 2 所示。
在某些设定下，CRD 方法可与 KD 结合（CRD+KD）以进一步提升性能。
在模型压缩、跨模态转移和集成蒸馏方面，CRD 在若干配置下达到最新的最优结果。
该方法强调从教师的表示中传递信息，而不仅仅是条件类概率，当与 KD 结合时，有时甚至超过教师模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。