QUICK REVIEW

[논문 리뷰] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Size Wu, Wenwei Zhang|arXiv (Cornell University)|2023. 10. 02.

Multimodal Machine Learning Applications인용 수 12

한 줄 요약

CLIPSelf은 dense features에 대한 self-distillation으로 CLIP Vision Transformers를 미세조정하고 영역 표현을 이미지 수준 표현과 정렬하되 영역-텍스트 쌍 없이, 오픈-보캐뷰리 객체 탐지 및 분할을 향상시킨다.

ABSTRACT

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

연구 동기 및 목표

오픈-어휘 밀도 예측 작업(탐지 및 분할)에 동기를 부여하고 CLIP ViT에서 영역-언어 정렬을 분석한다.
CLIP ViT의 밀도 특징이 영역 인식에서 저조한 이유를 조사하고 영역→이미지 self-distillation 솔루션을 제안한다.
Dense 맵에서의 영역 표현을 이미지 크롭에 정렬하기 위해 CLIPSelf를 개발한다.
OV-COCO, OV-LVIS 및 오픈-어휘 분할 벤치마크에서 최첨단 결과를 입증한다.

제안 방법

마지막 ViT 블록에서 최종 블록의 self-attention을 제거하여 dense feature 맵을 추출하고 영역을 위한 공간 특성 맵을 생성한다.
이미지를 임의의 m x n 격자로 분할하고 이 패치를 self-distillation의 영역으로 사용한다.
고정된 Teacher CLIPViT를 유지한 채 Student CLIPViT를 미세조정하고, 대응 패치의 Teacher 이미지 표현과 Student 영역 임베딩 간의 코사인 유사도를 최대화하여 학습한다.
dense 맵에서 풀링(RoIAlign)을 통해 영역 임베딩을 계산하고, 코사인 유사도 손실(cosine-similarity loss)을 사용하여 Teacher의 이미지 표현과 정렬한다.
영역-이미지 정렬을 최대화하도록 ViT의 모든 attention 레이어를 업데이트하고, Student의 입력 크기가 클수록 영역 인식이 향상된다.
미세조정된 CLIPViTs를 오픈-어휘 탐지(고정된 백본에서의 2단계 탐지기), 의미 분할(Cat-Seg 초기화), 팬오픽 분할(ODISE 추론 단계)에 적용한다.

실험 결과

연구 질문

RQ1ViT 기반 CLIP 모델이 오픈-어휘 밀도 작업에서 지역 표현을 언어와 얼마나 잘 정렬하는가?
RQ2이미지 수준 CLIP 표현으로부터의 self-distillation이 영역-텍스트 쌍 없이도 밀도 영역 표현을 개선할 수 있는가?
RQ3밀도 영역 임베딩을 이미지 크롭에 정렬하는 것이 오픈-어휘 객체 탐지 및 분할 성능을 향상시키는가?
RQ4CLIPSelf가 서로 다른 ViT 크기 및 학습 데이터(CC3M 등)에서 견고하며 윈도우-어텐션 변형과 호환되는가?
RQ5영역 제안(region proposals)과 패치 기반 영역(patch-based regions)을 사용하는 것의 상대적 이점은 무엇인가?

주요 결과

CLIPViT dense 표현은 이미지 크롭에 비해 영역 수준 인식에서 저조하여 영역-에서-이미지 정렬을 동기화해야 함을 시사한다.
CLIPSelf, a self-distillation method using random m x n image patches, significantly improves region and panoptic mask classification accuracy over the baseline ViT CLIP model.
교사의 이미지 표현을 지도로 사용하여 학생 ViT가 해당 이미지 크롭과 정렬되는 영역 임베딩을 생성하도록 학습시키고, 오픈-어휘 탐지 및 분할 성능을 향상시킨다.
백본을 CLIPSelf-증강 ViTs로 교체하면 OV-COCO 및 OV-LVIS에서 오픈-어휘 객체 탐지의 최첨단 결과를 달성하고, 오픈-어휘 분할 및 팬오픽 분할 벤치마크를 향상시킨다.
CLIPSelf는 영역-텍스트 쌍 접근법(잡음이 있는 영역-텍스트 매칭)보다 성능이 우수하고, 로컬 윈도우 어텐션 변형에서도 효과적이며 CC3M 데이터로 학습해도 여전히 효과적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.