Skip to main content
QUICK REVIEW

[论文解读] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

Yixuan Wei, Han Hu|arXiv (Cornell University)|May 27, 2022
Advanced Neural Network Applications被引用 54
一句话总结

论文引入特征蒸馏来对预训练表示进行后处理,使其转化为便于优化的特征,缩小对比学习/自监督方法与掩码图像建模(MIM)之间的微调差距。这在多种模型上产生强劲的微调提升,包括 CLIP 和 SwinV2-G,并分析推动改进的优化属性。

ABSTRACT

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

研究动机与目标

  • Motivate and quantify why masked image modeling (MIM) excels at fine-tuning compared to other pre-training paradigms.
  • Propose a generic feature distillation (FD) method that can be applied to arbitrary pre-trained models to improve fine-tuning performance.
  • Identify and analyze the optimization-friendly properties of representations introduced by FD.
  • Demonstrate that FD brings non-MIM methods (including contrastive and CLIP-based) to competitive or superior fine-tuning performance.
  • Show practical gains across ImageNet-1K classification, ADE20K segmentation, and COCO detection.

提出的方法

  • Distill feature maps from a pre-trained teacher into a student network using a 1x1 convolution to align dimensions.
  • Whiten the teacher feature maps to normalize magnitudes and improve distillation stability.
  • Use a smooth L1 loss between the transformed student features and whitened teacher features for distillation.
  • Employ shared relative position bias (RPB) across layers and asymmetric drop path rates between teacher and student to enhance optimization friendliness.
  • Evaluate various distillation targets (full feature map vs logits) and find full feature maps yield best gains.
  • Analyze attention properties (average attention distance, head diversity, attention maps) and loss landscapes to diagnose optimization friendliness.

实验结果

研究问题

  • RQ1Can feature distillation improve the fine-tuning performance of pre-trained models across diverse pre-training paradigms (DINO, EsViT, CLIP, DeiT, MAE)?
  • RQ2Does distilling features (instead of logits) yield better transfer, and how do normalization and position-encoding choices affect performance?
  • RQ3What are the optimization-friendly properties responsible for FD gains, and how do they relate to attention patterns and loss landscapes?
  • RQ4How close can non-MIM methods come to MIM performance in fine-tuning after FD?
  • RQ5Do the gains generalize to large-scale models and downstream tasks like semantic segmentation and object detection?

主要发现

方法骨干网络F. D.IN-1KADE20Kf.t.线性头
BEiTViT-B224283.237.647.1-
MAEViT-B224283.668.048.1-
SimMIMViT-B224283.856.747.6-
SimMIMSwin-B224284.824.848.3-
WiSE-FT CLIPViT-L336287.1---
DINOViT-B224282.878.246.2-
FD-DINOViT-B224283.8 (+1.0)76.147.7 (+1.5)
EsViTSwin-B224283.981.347.3-
FD-EsViTSwin-B224285.1 (+1.2)80.448.9 (+1.6)
DeiTViT-B224281.8-47.0-
FD-DeiT224283.0 (+1.2)-48.0 (+1.0)-
CLIPViT-B224282.979.549.5-
FD-CLIP224284.9 (+2.0)80.352.8 (+3.3)-
CLIPViT-L224286.183.553.5-
FD-CLIP224287.7 (+1.6)84.855.7 (+2.2)-
FD-CLIP*336289.0---
  • Feature distillation consistently improves ImageNet-1K fine-tuning by roughly 1.0%–2.0% across several pre-training methods.
  • FD enables non-MIM methods (e.g., DINO, EsViT, CLIP, DeiT) to reach competitive or superior fine-tuning performance relative to MIM approaches.
  • CLIP ViT-L with FD reaches 89.0% top-1 accuracy on ImageNet-1K, surpassing prior CLIP fine-tuning results by up to 1.9%.
  • On the 3B-parameter SwinV2-G, FD improves ADE20K mIoU by +1.5 and COCO AP by +1.1, achieving 61.4 mIoU and 64.2 AP.
  • FD tends to create more diverse attention heads, greater reliance on relative positions, and flatter loss landscapes, all contributing to improved fine-tuning.
  • MAE representations show limited extra gains from FD, indicating overlapping optimization-friendly effects with MIM.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。