QUICK REVIEW

[论文解读] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

Yixuan Wei, Han Hu|arXiv (Cornell University)|May 27, 2022

Advanced Neural Network Applications被引用 54

一句话总结

论文引入特征蒸馏来对预训练表示进行后处理，使其转化为便于优化的特征，缩小对比学习/自监督方法与掩码图像建模（MIM）之间的微调差距。这在多种模型上产生强劲的微调提升，包括 CLIP 和 SwinV2-G，并分析推动改进的优化属性。

ABSTRACT

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

研究动机与目标

Motivate and quantify why masked image modeling (MIM) excels at fine-tuning compared to other pre-training paradigms.
Propose a generic feature distillation (FD) method that can be applied to arbitrary pre-trained models to improve fine-tuning performance.
Identify and analyze the optimization-friendly properties of representations introduced by FD.
Demonstrate that FD brings non-MIM methods (including contrastive and CLIP-based) to competitive or superior fine-tuning performance.
Show practical gains across ImageNet-1K classification, ADE20K segmentation, and COCO detection.

提出的方法

Distill feature maps from a pre-trained teacher into a student network using a 1x1 convolution to align dimensions.
Whiten the teacher feature maps to normalize magnitudes and improve distillation stability.
Use a smooth L1 loss between the transformed student features and whitened teacher features for distillation.
Employ shared relative position bias (RPB) across layers and asymmetric drop path rates between teacher and student to enhance optimization friendliness.
Evaluate various distillation targets (full feature map vs logits) and find full feature maps yield best gains.
Analyze attention properties (average attention distance, head diversity, attention maps) and loss landscapes to diagnose optimization friendliness.

实验结果

研究问题

RQ1Can feature distillation improve the fine-tuning performance of pre-trained models across diverse pre-training paradigms (DINO, EsViT, CLIP, DeiT, MAE)?
RQ2Does distilling features (instead of logits) yield better transfer, and how do normalization and position-encoding choices affect performance?
RQ3What are the optimization-friendly properties responsible for FD gains, and how do they relate to attention patterns and loss landscapes?
RQ4How close can non-MIM methods come to MIM performance in fine-tuning after FD?
RQ5Do the gains generalize to large-scale models and downstream tasks like semantic segmentation and object detection?

主要发现

方法	骨干网络	F. D.	IN-1K	ADE20K	f.t.	线性头
BEiT	ViT-B	2242	83.2	37.6	47.1	-
MAE	ViT-B	2242	83.6	68.0	48.1	-
SimMIM	ViT-B	2242	83.8	56.7	47.6	-
SimMIM	Swin-B	2242	84.8	24.8	48.3	-
WiSE-FT CLIP	ViT-L	3362	87.1	-	-	-
DINO	ViT-B	2242	82.8	78.2	46.2	-
FD-DINO	ViT-B	2242	✓	83.8 (+1.0)	76.1	47.7 (+1.5)
EsViT	Swin-B	2242	83.9	81.3	47.3	-
FD-EsViT	Swin-B	2242	✓	85.1 (+1.2)	80.4	48.9 (+1.6)
DeiT	ViT-B	2242	81.8	-	47.0	-
FD-DeiT	2242	✓	83.0 (+1.2)	-	48.0 (+1.0)	-
CLIP	ViT-B	2242	82.9	79.5	49.5	-
FD-CLIP	2242	✓	84.9 (+2.0)	80.3	52.8 (+3.3)	-
CLIP	ViT-L	2242	86.1	83.5	53.5	-
FD-CLIP	2242	✓	87.7 (+1.6)	84.8	55.7 (+2.2)	-
FD-CLIP*	3362	✓	89.0	-	-	-

Feature distillation consistently improves ImageNet-1K fine-tuning by roughly 1.0%–2.0% across several pre-training methods.
FD enables non-MIM methods (e.g., DINO, EsViT, CLIP, DeiT) to reach competitive or superior fine-tuning performance relative to MIM approaches.
CLIP ViT-L with FD reaches 89.0% top-1 accuracy on ImageNet-1K, surpassing prior CLIP fine-tuning results by up to 1.9%.
On the 3B-parameter SwinV2-G, FD improves ADE20K mIoU by +1.5 and COCO AP by +1.1, achieving 61.4 mIoU and 64.2 AP.
FD tends to create more diverse attention heads, greater reliance on relative positions, and flatter loss landscapes, all contributing to improved fine-tuning.
MAE representations show limited extra gains from FD, indicating overlapping optimization-friendly effects with MIM.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。