QUICK REVIEW

[論文レビュー] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

Manli Shu, Weili Nie|arXiv (Cornell University)|Sep 15, 2022

Multimodal Machine Learning Applications被引用数 112

ひとこと要約

この論文は Test-Time Prompt Tuning (TPT) を導入し、CLIP のような視覚-言語モデルのゼロショット一般化を向上させるため、拡張ビュー間のエントロピー最小化と信頼度ベースのフィルタリングを用いて、単一のテストサンプル上でプロンプトを最適化します。

ABSTRACT

Pre-trained vision-language models (e.g., CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of relying on hand-engineered prompts, recent works learn prompts using the training data from downstream tasks. While effective, training on domain-specific data reduces a model's generalization capability to unseen new domains. In this work, we propose test-time prompt tuning (TPT), a method that can learn adaptive prompts on the fly with a single test sample. For image classification, TPT optimizes the prompt by minimizing the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample. In evaluating generalization to natural distribution shifts, TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average, surpassing previous prompt tuning approaches that require additional task-specific training data. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data. Project page: https://azshue.github.io/TPT.

研究の動機と目的

Enhance zero-shot generalization of CLIP without additional training data or annotations.
Develop a test-time objective that aligns predictions across augmented views of a single test image.
Introduce confidence selection to remove noisy augmentations during prompt tuning.
Demonstrate TPT on image classification under distribution shifts and on context-dependent visual reasoning.
Show that TPT matches or exceeds state-of-the-art prompt tuning that uses training data in various settings.

提案手法

Represent the prompt as a learnable text embedding and optimize it at test time.
Generate N augmented views of the test image and minimize the marginal entropy of predictions across views.
Apply confidence selection by discarding augmented views with high self-entropy based on a percentile threshold.
For Bongard-HOI visual reasoning, learn both the prompt and binary label tokens from support images without using query annotations.
TPT uses the CLIP framework, focusing on updating only the text prompt to preserve zero-shot abilities.
One-step prompt optimization with AdamW on a single test example.

実験結果

リサーチクエスチョン

RQ1Can test-time prompt tuning improve zero-shot CLIP performance under natural distribution shifts without any training data?
RQ2How does TPT compare to few-shot prompt tuning methods on cross-dataset generalization and unseen categories?
RQ3Can TPT be effectively extended to context-dependent visual reasoning tasks like Bongard-HOI without training data?
RQ4What is the impact of confidence-based view selection on prompt-tuning effectiveness?

主な発見

手法	ImageNet	ImageNet-A	ImageNet-V2	ImageNet-R	ImageNet-Sketch	平均	OOD平均
CLIP-RN50	58.16	21.83	51.41	56.15	33.37	44.18	40.69
Ensemble	59.81	23.24	52.91	60.72	35.48	46.43	43.09
CoOp	63.33	23.06	55.40	56.60	34.67	46.61	42.43
CoCoOp	62.81	23.32	55.72	57.74	34.48	46.81	42.82
TPT	60.74	26.67	54.70	59.11	35.09	47.26	43.89
TPT + CoOp	64.73	30.32	57.83	58.99	35.86	49.55	45.75
TPT + CoCoOp	62.93	27.40	56.60	59.88	35.43	48.45	44.83
CLIP-ViT-B/16	66.73	47.87	60.86	73.98	46.09	59.11	57.20
Ensemble	68.34	49.89	61.88	77.65	48.24	61.20	59.42
CoOp	71.51	49.71	64.20	75.21	47.99	61.72	59.28
CoCoOp	71.02	50.63	64.07	76.18	48.75	62.13	59.91
TPT	68.98	54.77	63.45	77.06	47.94	62.44	60.81
TPT + CoOp	73.61	57.95	66.83	77.27	49.29	64.99	62.83
TPT + CoCoOp	71.07	58.47	64.85	78.65	48.47	64.30	62.61

TPT improves zero-shot top-1 accuracy of CLIP by 3.6% on average over natural distribution shifts compared to hand-crafted prompts.
TPT matches or surpasses state-of-the-art prompt tuning methods that require downstream training data in several settings.
TPT achieves up to 6.9% improvement on ImageNet-A over hand-crafted prompts.
In cross-dataset generalization, TPT attains on-par performance with few-shot methods without using training data.
For Bongard-HOI visual reasoning, TPT outperforms the state-of-the-art by 4.1%.
Confidence selection helps suppress noisy augmentations and boosts entropy minimization efficacy.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。