QUICK REVIEW

[논문 리뷰] Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

Yixuan Wei, Han Hu|arXiv (Cornell University)|2022. 05. 27.

Advanced Neural Network Applications인용 수 54

한 줄 요약

본 논문은 특징 증류(feature distillation)을 통해 사전 학습된 표현을 후처리하여 최적화에 친숙한 특징으로 변환하고, 대조적/자기감쇠 방법과 마스킹된 이미지 모델링(MIM) 간의 미세 조정 격차를 좁힌다. 이는 CLIP 및 SwinV2-G를 포함한 다양한 모델에서 강력한 미세 조정 이점을 제공하고, 개선을 이끄는 최적화 특성을 분석한다.

ABSTRACT

Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

연구 동기 및 목표

마스크된 이미지 모델링(MIM)이 다른 사전 학습 패러다그램에 비해 미세 조정에서 왜 우수한지 동기를 부여하고 정량화한다.
미세 조정 성능을 향상시키기 위해 임의의 사전 학습 모델에 적용할 수 있는 일반적인 특징 증류(FD) 방법을 제안한다.
FD가 도입한 표현의 최적화 친화적 특성을 식별하고 분석한다.
FD가 비-MIM 방법(대조적 방법 및 CLIP 기반 포함)을 경쟁력 있거나 더 우수한 미세 조정 성능으로 이끈다는 것을 입증한다.
ImageNet-1K 분류, ADE20K 분할 및 COCO 검출 전반에 걸친 실용적 이점을 보여준다.

제안 방법

차원을 맞추기 위해 1x1 컨볼루션을 사용하여 사전 학습된 교사로부터 특징 맵을 학생 네트워크로 증류한다.
증류 안정성을 높이고 크기를 정규화하기 위해 교사 특징 맵을 화이트닝 한다.
변환된 학생 특징과 화이트닝된 교사 특징 사이의 부드러운 L1 손실을 사용하여 증류한다.
레이어 간에 공통된 상대 위치 편향(RPB)을 사용하고 교사와 학생 간의 비대칭 드롭 패스 비율을 적용하여 최적화 친화성을 향상시킨다.
다양한 증류 대상(전체 특징 맵 대 로짓)을 평가하고 전체 특징 맵이 가장 큰 이점을 가져옴을 확인한다.
최적화 친화성을 진단하기 위해 주의 속성(평균 주의 거리, 헤드 다양성, 주의 맵)과 손실 지형을 분석한다.

실험 결과

연구 질문

RQ1특징 증류가 다양한 사전 학습 패러다임(DINO, EsViT, CLIP, DeiT, MAE)에 걸쳐 사전 학습 모델의 미세 조정 성능을 향상시킬 수 있는가?
RQ2특징 증류(로짓 대신)를 사용하면 전달이 더 나은가? 정규화 및 위치 부호화 선택이 성능에 어떤 영향을 미치는가?
RQ3FD 이득에 대한 최적화 친화적 특성은 무엇이며, 이것이 주의 패턴 및 손실 지형과 어떻게 관련되는가?
RQ4FD 이후 비-MIM 방법이 미세 조정에서 MIM 성능에 얼마나 근접할 수 있는가?
RQ5대규모 모델 및 의미론적 분할, 객체 검출과 같은 다운스트림 작업으로 이득이 일반화되는가?

주요 결과

방법	백본	F. D.	IN-1K	ADE20K	미세조정	선형
BEiT	ViT-B	2242	83.2	37.6	47.1	-
MAE	ViT-B	2242	83.6	68.0	48.1	-
SimMIM	ViT-B	2242	83.8	56.7	47.6	-
SimMIM	Swin-B	2242	84.8	24.8	48.3	-
WiSE-FT CLIP	ViT-L	3362	87.1	-	-	-
DINO	ViT-B	2242	82.8	78.2	46.2	-
FD-DINO	ViT-B	2242	✓	83.8 (+1.0)	76.1	47.7 (+1.5)
EsViT	Swin-B	2242	83.9	81.3	47.3	-
FD-EsViT	Swin-B	2242	✓	85.1 (+1.2)	80.4	48.9 (+1.6)
DeiT	ViT-B	2242	81.8	-	47.0	-
FD-DeiT	2242	✓	83.0 (+1.2)	-	48.0 (+1.0)	-
CLIP	ViT-B	2242	82.9	79.5	49.5	-
FD-CLIP	2242	✓	84.9 (+2.0)	80.3	52.8 (+3.3)	-
CLIP	ViT-L	2242	86.1	83.5	53.5	-
FD-CLIP	2242	✓	87.7 (+1.6)	84.8	55.7 (+2.2)	-
FD-CLIP*	3362	✓	89.0	-	-	-

특징 증류는 여러 사전 학습 방법에 걸쳐 ImageNet-1K 미세 조정을 일관되게 약 1.0%–2.0% 향상시킨다.
FD는 비-MIM 방법(DINO, EsViT, CLIP, DeiT)이 MIM 접근법 대비 경쟁력 있거나 우수한 미세 조정 성능을 달성하도록 한다.
CLIP ViT-L with FD reaches 89.0% top-1 accuracy on ImageNet-1K, surpassing prior CLIP fine-tuning results by up to 1.9%.
On the 3B-parameter SwinV2-G, FD improves ADE20K mIoU by +1.5 and COCO AP by +1.1, achieving 61.4 mIoU and 64.2 AP.
FD tends to create more diverse attention heads, greater reliance on relative positions, and flatter loss landscapes, all contributing to improved fine-tuning.
MAE representations show limited extra gains from FD, indicating overlapping optimization-friendly effects with MIM.]
table_headers: ["방법","백본","F. D.","IN-1K","ADE20K","미세조정","선형"]
table_rows: [["BEiT","ViT-B","2242","83.2","37.6","47.1","-"],["MAE","ViT-B","2242","83.6","68.0","48.1","-"],["SimMIM","ViT-B","2242","83.8","56.7","47.6","-"],["SimMIM","Swin-B","2242","84.8","24.8","48.3","-"],["WiSE-FT CLIP","ViT-L","3362","87.1","-","-","-"],["DINO","ViT-B","2242","82.8","78.2","46.2","-"],["FD-DINO","ViT-B","2242","✓","83.8 (+1.0)","76.1","47.7 (+1.5)"],["EsViT","Swin-B","2242","83.9","81.3","47.3","-"],["FD-EsViT","Swin-B","2242","✓","85.1 (+1.2)","80.4","48.9 (+1.6)"],["DeiT","ViT-B","2242","81.8","-","47.0","-"],["FD-DeiT","2242","✓","83.0 (+1.2)","-","48.0 (+1.0)","-"],["CLIP","ViT-B","2242","82.9","79.5","49.5","-"],["FD-CLIP","2242","✓","84.9 (+2.0)","80.3","52.8 (+3.3)","-"],["CLIP","ViT-L","2242","86.1","83.5","53.5","-"],["FD-CLIP","2242","✓","87.7 (+1.6)","84.8","55.7 (+2.2)","-"],["FD-CLIP*","3362","✓","89.0","-","-","-"],["",""]]}{title:

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.