QUICK REVIEW

[論文レビュー] Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

Hantao Yao, Rui Zhang|arXiv (Cornell University)|Mar 23, 2023

Multimodal Machine Learning Applications被引用数 17

ひとこと要約

KgCoOp は CoOp を拡張し、学習可能なプロンプトを手作成プロンプトに近づける正則化を行い、未見クラスへの一般化を CLIP で改善しつつ、トレーニングを高速に保つ。

ABSTRACT

Prompt tuning is an effective way to adapt the pre-trained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based work combines the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge is the worse generalization to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. The key insight of KgCoOp is that forgetting about essential knowledge can be alleviated by reducing the discrepancy between the learnable prompt and the hand-crafted prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, \emph{i.e.,} achieves better performance with less training time.

研究の動機と目的

事前学習済み視覚言語モデルのプロンプト調整におけるより良い一般化の必要性を動機づける。
一般的な手作成プロンプトと学習済みプロンプトを整合させ、一般知識を保持する正則化項を導入する。
プロンプトの乖離を最小化することで、見たことのあるクラスの精度を損なうことなく未見クラスでの性能が向上することを示す。
11データセットと複数のバックボーンにわたるベース→ニュー、 Few-shot、ドメイン一般化設定で KgCoOp を評価する。

提案手法

CoOp を基に、知識指向のコンテキスト最適化項を追加する。
手作成プロンプトからの一般的なテキスト知識を定義する（例: CLIP の a photo of a [Class]）。
少数ショットデータから生成された学習可能なプロンプトからの特定のテキスト知識を定義する。
L_kg = (1/N_c) sum_i ||w_i − w_i_clip||^2 を導入し、学習済み埋め込みと一般埋め込みの乖離を最小化する。
総損失 L = L_ce + lambda * L_kg を最適化し、lambda が二つの項を調整する。
KgCoOp が CoOp と同じトレーニング時間で、ベースクラスの性能はほぼ同等、未見クラスの性能はより高く達成できることを示す。

実験結果

リサーチクエスチョン

RQ1学習可能なプロンプトと一般的なプロンプトの近さを強制することで、見たことのあるクラスの性能を損なうことなく未見クラスの一般化を改善するか？
RQ2多様なデータセットにおけるベース→ニュー、Few-shot、ドメイン一般化シナリオで KgCoOp はどのように性能を示すか？
RQ3正則化重み lambda が一般化と調和平均性能に与える影響はどの程度か？
RQ4既存の CoOp ベースの手法の上に KgCoOp を適用して未見クラスの一般化を改善できるか？

主な発見

KgCoOp は CoOp、CoCoOp、ProGrad よりもベース→ニュー一般化で調和平均性能 (H) が高い。
KgCoOp は競合する CoOp 系手法より新規クラスの精度が高く、ベースクラスの性能は CoCoOp に近い。
KgCoOp のトレーニング時間は CoOp と同等で、CoCoOp および ProGrad より速い。
KgCoOp はドメイン一般化を改善し、ImageNet 派生ターゲットで CoCoOp より平均的なターゲット性能が高い。
Few-shot 設定では、KgCoOp はデータセット全体で平均的にベースラインを上回り、未見クラスの性能を向上させる。
正則化パラメータ lambda はトレードオフを支配する。最適な lambda（例: 8.0）は 4-shot/16-shot シナリオで最良の調和平均をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。