QUICK REVIEW

[논문 리뷰] CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Peng Gao, Shijie Geng|arXiv (Cornell University)|2021. 10. 09.

Multimodal Machine Learning Applications참고 문헌 48인용 수 111

한 줄 요약

CLIP-Adapter는 경량 특징 어댑터를 잔여 혼합으로 삽입해 비전-언어 모델을 미세 조정하며, 프롬프트 튜닝에 대한 간단하고 효과적인 소수 샷 작업 대안을 제시합니다.

ABSTRACT

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

연구 동기 및 목표

개방 어휘 CLIP 설정에서 프롬프트 튜닝을 넘어 비전-언어 모델의 개선을 동기화한다.
백본을 고정한 채 CLIP를 미세 조정하기 위한 경량의 병목(bottleneck) 피처 어댑터를 제안한다.
제로샷 사전 학습 지식과 새로운 지식을 결합하기 위한 잔여 스타일의 블렌딩을 가능하게 한다.
분해 실험(ABLATION)을 통해 eleven 데이터셋과 다양한 소수 샷 조건에서의 효과를 입증한다.

제안 방법

CLIP의 이미지 및/또는 텍스트 분기에 두 개의 작은 병목 선형 어댑터를 추가한다.
원래의 CLIP 백본을 고정하고 어댑터만 소수 샷 데이터로 학습한다.
적용된 피처를 잔여 연결로 원래 피처와 혼합하는 방식으로 잔류 비율 α와 β로 제어한다.
기존 W로 분류기 가중치를 형성하고, 잔류 믹스로 보완하는 병렬 어댑터를 통해 이를 조정한다.
데 dataset별 튜닝을 위한 α와 β를 하이퍼네트워크를 통해 학습하는 것을 선택적으로 시도한다.
세 가지 변형을 탐색한다: 이미지 전용 어댑터, 텍스트 전용 어댑터, 그리고 두 어댑터 모두; 기본은 이미지 어댑터를 사용한다.

실험 결과

연구 질문

RQ1가벼운 특징 어댑터로의 미세 조정이 소수 샷 비전-언어 분류에서 프롬프트 튜닝 방법과 동등하거나 그 이상 성능을 낼 수 있는가?
RQ2잔류 연결과 병목 설계가 과적합을 줄이고 다양한 데이터셋에서 일반화를 향상시키는가?
RQ3다양한 데이터셋 특성에 대해 가장 적절한 구성은 무엇인가? (어떤 분기를 적용할지, 병목 크기, 잔류 비율 등)
RQ4학습 가능한 잔류 비율이 데이터셋 전반의 성능을 더 개선할 수 있는가?
RQ5프롬프트 기반 방법과 비교하여 어댑터가 학습된 특징 매니폴드에 어떤 영향을 미치는가?

주요 결과

병목 차원	DTD (%)	ImageNet (%)
D	65.03	59.78
D/2	65.62	60.03
D/4	66.06	61.33
D/8	64.93	60.06
D/16	63.75	60.02
D/32	63.50	59.45

CLIP-Adapter는 11개 데이터셋에서 다양한 소수 샷 설정으로 제로샷 CLIP, 선형 프로브 CLIP, 및 CoOp를 능가한다.
Residual blending with bottleneck adapters yields strong generalization, especially in very low-shot regimes (1–2 shots).
이미지 분기(시각 어댑터)를 미세 조정하는 것이 일반적으로 텍스트 전용 적응보다 더 큰 이득을 주며, 두 어댑터를 모두 결합하는 것이 항상 우수하지는 않다.
최적의 병목 차원은 대략 D/4이며, D는 원래 피처 차원이다. 더 크거나 작은 병목은 성능을 저하시키는 것으로 나타난다.
Best residual ratio α Trends: fine-grained datasets favor higher α (0.6), generic datasets favor lower α (≈0.2); α=0 recovers Zero-shot CLIP, α=1 overfits.
Variants with learnable α, β via a hypernetwork can achieve competitive results without manual tuning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.