QUICK REVIEW

[논문 리뷰] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Renrui Zhang, Rongyao Fang|arXiv (Cornell University)|2021. 11. 06.

Multimodal Machine Learning Applications참고 문헌 66인용 수 128

한 줄 요약

Tip-Adapter는 몇 샷 캐시로 학습 없이 비-parametric한 두 층 MLP 어댑터를 구성하여 CLIP를 보강하고, 학습 기반 어댑터에 비해 빠른 수렴과 경쟁력 있는 few-shot 성능을 달성합니다.

ABSTRACT

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose extbf{T}raining-Free CL extbf{IP}- extbf{Adapter} ( extbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.

연구 동기 및 목표

전체 어댑터 미세조정이나 프롬프트 설계 없이 CLIP의 few-shot 능력을 향상시키는 것을 동기화한다.
few-shot 지식을 사전 학습된 CLIP 특징과 융합하기 위한 학습-free, 캐시 기반 어댑터를 제안한다.
다양한 데이터셋과 백본에서 경쟁력 있는 few-shot 분류 성능을 시연한다.
캐시로 초기화된 미세조정이 빠른 수렴과 함께 성능을 더욱 향상시키는 것을 보여준다.

제안 방법

CLIP에 잔여 연결을 갖는 2-층 MLP 어댑터를 추가한다.
K-shot 학습 세트에서 키-값 캐시를 구성하며, 키는 CLIP 시각 특성이고 값은 원-핫 레이블이다.
어댑터 가중치 W1과 W2를 캐시에서 직접 설정한다 (W1 = F_train, W2 = L_train^T) 학습 없이 어댑터를 만들기 위하여.
테스트 시 로짓을 캐시 프로파게이션 예측과 사전 학습된 CLIP 예측의 조합으로 계산하고 잔여 비율 alpha로 균형을 맞춘다.
선택적으로 W1의 동결을 해제하고 몇 에폭(예: 20) 동안 미세조정하여 빠른 수렴으로 성능을 더욱 향상시킨다.
캐시 검색에서 친화도를 조절하기 위해 새로운 활성화 phi(x) = exp(-beta(1 - x))를 사용한다.

실험 결과

연구 질문

RQ1학습 없는 캐시 기반 어댑터가 few-shot 분류에서 SGD로 미세조정된 CLIP-Adapter의 성능과 맞먹거나 이를 능가할 수 있는가?
RQ2CLIP에 few-shot 캐시를 통합하면 다양한 데이터셋과 백본에서 제로샷 및 few-shot 전이가 어떻게 달라지는가?
RQ3캐시로 초기화된 상태에서 약간의 미세조정이 더 빠른 수렴과 더 높은 정확도를 제공하는가?

주요 결과

Tip-Adapter는 학습 없이 CLIP-Adapter에 비해 경쟁력 있는 few-shot 성능을 달성한다.
Tip-Adapter-F(몇 에폭의 미세조정)은 여러 데이터셋과 백본에서 모든 비교 방법을 능가한다.
캐시 기반 초기화는 빠른 수렴을 가능하게 하며 CLIP-Adapter보다 훨씬 적은 에폭(예: 20 대 200)을 필요로 한다.
캐시에서의 성능 향상은 샷이 늘어나면 증가하지만, 캐시 크기가 고정되어 있을 때는 이득이 줄어든다(실험에서 16).
잔여 비율 alpha가 적응과 기존 CLIP 지식을 균형 잡으며, 최적 값은 ablations에서 약 α ≈ 1.0 근처이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.