QUICK REVIEW

[論文レビュー] Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

Renrui Zhang, Rongyao Fang|arXiv (Cornell University)|Nov 6, 2021

Multimodal Machine Learning Applications参考文献 66被引用数 128

ひとこと要約

Tip-Adapter は few-shot キャッシュから訓練不要の非パラメトリックな二層MLPアダプターを構築し、CLIPを拡張します。訓練ベースのアダプターと比較して、few-shot性能が競合的で収束が速い。

ABSTRACT

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations by using large-scale contrastive image-text pairs. It shows impressive performance on zero-shot knowledge transfer to downstream tasks. To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification. However, such a process still needs extra training and computational resources. In this paper, we propose extbf{T}raining-Free CL extbf{IP}- extbf{Adapter} ( extbf{Tip-Adapter}), which not only inherits CLIP's training-free advantage but also performs comparably or even better than CLIP-Adapter. Tip-Adapter does not require any back propagation for training the adapter, but creates the weights by a key-value cache model constructed from the few-shot training set. In this non-parametric manner, Tip-Adapter acquires well-performed adapter weights without any training, which is both efficient and effective. Moreover, the performance of Tip-Adapter can be further boosted by fine-tuning such properly initialized adapter for only a few epochs with super-fast convergence speed. We conduct extensive experiments of few-shot classification on ImageNet and other 10 datasets to demonstrate the superiority of proposed Tip-Adapter. The code will be released at \url{https://github.com/gaopengcuhk/Tip-Adapter}.

研究の動機と目的

CLIPの few-shot 能力を完全なアダプタ微調整やプロンプト設計なしで向上させる動機付け。
訓練不要のキャッシュベースのアダプターを提案し、 few-shot の知識を事前学習済み CLIP の特徴と融合させる。
多様なデータセットとバックボーンにまたがる競争力のある few-shot 分類性能を実証。
キャッシュで初期化された状態から微調整することで、収束の速さをさらに高め、性能を向上させることを示す。

提案手法

残差接続付きの二層MLPアダプターを CLIP に追加する。
K-shot 訓練集合からキーと値のキャッシュを構築する。キーは CLIP の視覚特徴、値は one-hot ラベル。
アダプターの重み W1 と W2 をキャッシュから直接設定する（W1 = F_train, W2 = L_train^T）訓練不要にする。
テスト時のロジットを、キャッシュ伝搬予測と事前訓練済み CLIP の予測の組み合わせとして計算し、残差比 alpha でバランスを取る。
任意で W1 の凍結を解除し、数エポック（例: 20）微調整して、収束の速さと性能をさらに向上させる。
新しい活性化 phi(x) = exp(-beta(1 - x)) を使用してキャッシュ検索の類似度を調整する。

実験結果

リサーチクエスチョン

RQ1訓練不要のキャッシュベースのアダプターは few-shot 分類において SGDで微調整した CLIP-Adapter の性能と同等または上回ることができるか？
RQ2CLIP と few-shot キャッシュを統合すると、ゼロショットおよび few-shot の転移は多様なデータセットとバックボーンでどう影響を受けるか？
RQ3キャッシュ初期化状態からの少量の微調整は、より速い収束と高精度を生み出すか？

主な発見

Tip-Adapter は訓練なしで CLIP-Adapter と比較して競争力のある few-shot パフォーマンスを達成。
Tip-Adapter-F（数エポックの微調整を伴う）は、複数のデータセットとバックボーンで比較対象の全手法を上回る。
キャッシュベースの初期化により速い収束が可能で、CLIP-Adapterよりはるかに少ないエポック数を要する（例: 20 対 200）。
キャッシュからの性能向上はショット数が増えると増加するが、キャッシュサイズが固定されると利得は低下する（実験では 16）。
残差比 alpha が適応と prior CLIP 知識のバランスを取り、分解実験では最適値は alpha ≈ 1.0 に近い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。