QUICK REVIEW

[论文解读] The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

Tianlong Chen, Jonathan Frankle|arXiv (Cornell University)|Dec 12, 2020

Advanced Neural Network Applications参考文献 87被引用 31

一句话总结

本文研究在预训练的计算机视觉模型（有监督与自监督）中，是否存在能够在不同下游任务中转移且不损失性能的匹配子网。结果在分类、检测和分割等任务上，在相当高的稀疏度下，存在普遍可转移的彩票式子网。

ABSTRACT

The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its downstream transferability? In this paper, we examine supervised and self-supervised pre-trained models through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch yet still reach the full models' performance. We extend the scope of LTH and question whether matching subnetworks still exist in pre-trained computer vision models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_LTH_Pre-training.

研究动机与目标

评估在预训练的CV模型中是否存在能够保持下游迁移性能的匹配子网。
确定是否存在可跨越多样下游任务（分类、检测、分割）可转移的普遍子网。
比较来自有监督与自监督预训练的子网在迁移性和结构敏感性方面的差异。

提出的方法

将预训练权重视为子网的初始化。
应用迭代阈值裁剪（IMP）来识别匹配子网。
将其定义为在相同训练设置下，其迁移性能至少与完整预训练模型相当的子网。
在多种下游任务与数据集上评估子网的迁移性（分类、检测、分割）。
分析不同预训练类型（ImageNet、simCLR、MoCo）的 mask 多样性与扰动敏感性。
探讨更大规模的预训练模型和温度参数设置对迁移性的影响。

实验结果

研究问题

RQ1在预训练任务中找到的中奖票是否也能成为下游任务的中奖票？
RQ2在来自不同预训练方案的初始化下，是否存在跨多样下游任务的普遍、可转移的子网？
RQ3有监督与自监督预训练所得子网在迁移性与掩码结构方面有何不同？

主要发现

在有监督的 ImageNet、simCLR 和 MoCo 预训练下，分别存在 67.23%、59.04%、和 95.60% 稀疏度的中奖票。
来自预训练的子网在多样下游分类任务上能够普遍迁移，在 CIFAR-10、CIFAR-100、SVHN 和 Fashion-MNIST 的稀疏度大约在 86.58%–91.41%；而 VisDA2017 需要更高的容量（大约 67.23%–59.04%）。
从预训练中转移的子网在检测和分割等下游任务上可以超越直接在下游任务上找到的子网（例如检测和分割的稀疏度为 95.60%/93.13%/97.75%）。
在预训练类型中，MoCo 预训练的子网在检测/分割上的迁移效果最好，而 ImageNet 与 simCLR 在下游任务和稀疏度方面呈现不同的优势。
从预训练中识别的子网展现出多样的掩码结构和扰动敏感性，在五轮 IMP 之后，不同预训练类型之间掩码重叠不到 6.55%。
对自监督预训练（simCLR）而言，裁剪更大的预训练模型可获得更好的可转移子网（如比较 ResNet-50 与 ResNet-152 时在 CIFAR-100 的结果所示）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。