QUICK REVIEW

[论文解读] M2KD: Multi-model and Multi-level Knowledge Distillation for Incremental Learning

Peng Zhou, Long Mai|arXiv (Cornell University)|Apr 3, 2019

Domain Adaptation and Few-Shot Learning参考文献 37被引用 45

一句话总结

本文提出了 M2KD，利用多模型和多层次的知识蒸馏，并通过剪枝进行模型重构，以在无样本和有样本的增量学习中缓解遗忘。

ABSTRACT

Incremental learning targets at achieving good performance on new categories without forgetting old ones. Knowledge distillation has been shown critical in preserving the performance on old classes. Conventional methods, however, sequentially distill knowledge only from the last model, leading to performance degradation on the old classes in later incremental learning steps. In this paper, we propose a multi-model and multi-level knowledge distillation strategy. Instead of sequentially distilling knowledge only from the last model, we directly leverage all previous model snapshots. In addition, we incorporate an auxiliary distillation to further preserve knowledge encoded at the intermediate feature levels. To make the model more memory efficient, we adapt mask based pruning to reconstruct all previous models with a small memory footprint. Experiments on standard incremental learning benchmarks show that our method preserves the knowledge on old classes better and improves the overall performance over standard distillation techniques.

研究动机与目标

在无法获得全部数据的情况下解决增量学习中的灾难性遗忘
通过从所有先前模型快照蒸馏来保持旧知识，而不仅仅是倒数第二个模型
通过辅助蒸馏利用中间特征来增强知识保留
通过基于掩码的剪枝重构历史模型来提高内存效率
在无样本设置中展示最先进的性能，并在有样本设置中取得强劲结果

提出的方法

引入一个多模型蒸馏损失，使当前模型输出与所有先前模型快照的输出保持一致
添加一个辅助蒸馏损失，以保留中间特征表示
使用基于掩码的剪枝来重建并仅存储历史模型的关键参数，从而实现在线模型重构
将多模型蒸馏和辅助蒸馏结合成总损失 L_total = L_MMD + lambda L_AD
在当前数据上使用标准交叉熵进行反向传播，同时从过去的模型进行蒸馏
与骨干网络无关的框架，兼容无样本和有样本的增量学习

实验结果

研究问题

RQ1是否从所有先前的模型快照蒸馏能比顺序的倒数第二个模型蒸馏更好地保留旧知识？
RQ2中间特征的辅助蒸馏是否在最终 logits 蒸馏之外进一步缓解遗忘？
RQ3基于掩码的剪枝是否能够在较低内存开销下有效地重构历史模型而不牺牲性能？
RQ4所提的 M2KD 方法是否与最先进的无样本和有样本增量方法相比具竞争力或更优？

主要发现

M2KD 在 CIFAR-100 与 iILSVRC-small 的无样本增量学习中取得了最先进的性能。
基于剪枝的重建使过去模型的蒸馏在内存高效的同时，与未剪枝变体的准确率相当。
辅助蒸馏通过保留中间特征统计（除了最终 logits 外）来提高保留度。
在有样本设置中，将 M2KD 与样本数据结合可进一步提升准确性，超过基线有样本方法。
该方法在不同步长（每步 5、10、20 类）下具有可扩展性，并在剪枝比下保持稳健性能。
与如 iCaRL 等有样本方法相比，内存成本显著降低，同时保持竞争力的准确率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。