[论文解读] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
论文引入特征蒸馏来对预训练表示进行后处理,使其转化为便于优化的特征,缩小对比学习/自监督方法与掩码图像建模(MIM)之间的微调差距。这在多种模型上产生强劲的微调提升,包括 CLIP 和 SwinV2-G,并分析推动改进的优化属性。
Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.
研究动机与目标
- Motivate and quantify why masked image modeling (MIM) excels at fine-tuning compared to other pre-training paradigms.
- Propose a generic feature distillation (FD) method that can be applied to arbitrary pre-trained models to improve fine-tuning performance.
- Identify and analyze the optimization-friendly properties of representations introduced by FD.
- Demonstrate that FD brings non-MIM methods (including contrastive and CLIP-based) to competitive or superior fine-tuning performance.
- Show practical gains across ImageNet-1K classification, ADE20K segmentation, and COCO detection.
提出的方法
- Distill feature maps from a pre-trained teacher into a student network using a 1x1 convolution to align dimensions.
- Whiten the teacher feature maps to normalize magnitudes and improve distillation stability.
- Use a smooth L1 loss between the transformed student features and whitened teacher features for distillation.
- Employ shared relative position bias (RPB) across layers and asymmetric drop path rates between teacher and student to enhance optimization friendliness.
- Evaluate various distillation targets (full feature map vs logits) and find full feature maps yield best gains.
- Analyze attention properties (average attention distance, head diversity, attention maps) and loss landscapes to diagnose optimization friendliness.
实验结果
研究问题
- RQ1Can feature distillation improve the fine-tuning performance of pre-trained models across diverse pre-training paradigms (DINO, EsViT, CLIP, DeiT, MAE)?
- RQ2Does distilling features (instead of logits) yield better transfer, and how do normalization and position-encoding choices affect performance?
- RQ3What are the optimization-friendly properties responsible for FD gains, and how do they relate to attention patterns and loss landscapes?
- RQ4How close can non-MIM methods come to MIM performance in fine-tuning after FD?
- RQ5Do the gains generalize to large-scale models and downstream tasks like semantic segmentation and object detection?
主要发现
| 方法 | 骨干网络 | F. D. | IN-1K | ADE20K | f.t. | 线性头 |
|---|---|---|---|---|---|---|
| BEiT | ViT-B | 2242 | 83.2 | 37.6 | 47.1 | - |
| MAE | ViT-B | 2242 | 83.6 | 68.0 | 48.1 | - |
| SimMIM | ViT-B | 2242 | 83.8 | 56.7 | 47.6 | - |
| SimMIM | Swin-B | 2242 | 84.8 | 24.8 | 48.3 | - |
| WiSE-FT CLIP | ViT-L | 3362 | 87.1 | - | - | - |
| DINO | ViT-B | 2242 | 82.8 | 78.2 | 46.2 | - |
| FD-DINO | ViT-B | 2242 | ✓ | 83.8 (+1.0) | 76.1 | 47.7 (+1.5) |
| EsViT | Swin-B | 2242 | 83.9 | 81.3 | 47.3 | - |
| FD-EsViT | Swin-B | 2242 | ✓ | 85.1 (+1.2) | 80.4 | 48.9 (+1.6) |
| DeiT | ViT-B | 2242 | 81.8 | - | 47.0 | - |
| FD-DeiT | 2242 | ✓ | 83.0 (+1.2) | - | 48.0 (+1.0) | - |
| CLIP | ViT-B | 2242 | 82.9 | 79.5 | 49.5 | - |
| FD-CLIP | 2242 | ✓ | 84.9 (+2.0) | 80.3 | 52.8 (+3.3) | - |
| CLIP | ViT-L | 2242 | 86.1 | 83.5 | 53.5 | - |
| FD-CLIP | 2242 | ✓ | 87.7 (+1.6) | 84.8 | 55.7 (+2.2) | - |
| FD-CLIP* | 3362 | ✓ | 89.0 | - | - | - |
- Feature distillation consistently improves ImageNet-1K fine-tuning by roughly 1.0%–2.0% across several pre-training methods.
- FD enables non-MIM methods (e.g., DINO, EsViT, CLIP, DeiT) to reach competitive or superior fine-tuning performance relative to MIM approaches.
- CLIP ViT-L with FD reaches 89.0% top-1 accuracy on ImageNet-1K, surpassing prior CLIP fine-tuning results by up to 1.9%.
- On the 3B-parameter SwinV2-G, FD improves ADE20K mIoU by +1.5 and COCO AP by +1.1, achieving 61.4 mIoU and 64.2 AP.
- FD tends to create more diverse attention heads, greater reliance on relative positions, and flatter loss landscapes, all contributing to improved fine-tuning.
- MAE representations show limited extra gains from FD, indicating overlapping optimization-friendly effects with MIM.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。