QUICK REVIEW

[論文レビュー] Visual Prompting via Image Inpainting

Amir Bar, Yossi Gandelsman|arXiv (Cornell University)|Sep 1, 2022

Multimodal Machine Learning Applications被引用数 50

ひとこと要約

本論文は、視覚タスクを学習済みの MAE-VQGAN によるグリッド画像インペインティングとして扱うことで、ラベルなしの大規模な figures データセットで訓練され、テスト時の visual prompting を用いて微調整なしにさまざまな画像-画像タスクを実行できることを示している。

ABSTRACT

How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting - literally just filling in a hole in a concatenated visual prompt image - turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated - 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.

研究の動機と目的

ファインチューニングやアーキテクチャの変更なしに、事前学習済みの視覚モデルをダウンストリームタスクへ適応させるためのプロンプトベースの手法を動機づける。
Propose framing downstream tasks as image inpainting on a visual prompt grid that contains examples and a query.
Create a large unlabeled figures dataset to train robust inpainting models for prompting tasks.
Evaluate prompting on multiple vision tasks to assess generalization and task coverage.
Investigate how prompting design choices affect performance and how data distribution impacts results.

提案手法

MAE-VQGAN を構築するために、masked auto-encoding (MAE) と VQGAN のコードブックを組み合わせて、マスクされた領域の視覚トークンを予測するインペインティングモデルを作成する。
MAE-VQGAN を curated の Computer Vision Figures データセット（ArXiv からの 88k unlabeled figures）および ImageNet データで訓練し、グリッド状のインペインティングが可能な表現を学習する。
1つ以上の入力-出力タスクの例と新しいクエリ画像をグリッド状の画像に結合し、インペイントする領域をマスクして視覚プロンプトを形成する。
例 S とクエリ x_q から視覚プロンプト x_vp を構築する、単純でハードコーディングされた関数 g を定義する。インペインティングはマスクされた領域を埋めてターゲット出力を得る。
堅牢性を高めるために、複数のプロンプトを生成して予測を平均化するプロンプトエンサンブリングを任意で適用する。
レイアウト、カラー、マスキングスタイルなどの prompting design choices を分析し、より多くの例とエンsembling で性能向上を実証する。

実験結果

リサーチクエスチョン

RQ1単一の事前訓練済み視覚モデルが、ファインチューニングなしにテスト時の視覚 prompting によって複数の down-stream 画像間タスクへ適応できるか？
RQ2タスク非依存のグリッドデータで学習した場合、画像インペインティングは視覚 prompting の実現可能な中核メカニズムとなり得るか？
RQ3訓練データの選択（Figures データセット vs. ImageNet）が、タスク間の prompting パフォーマンスにどのように影響するか？
RQ4prompt design の選択が prompting の品質とロバスト性に与える影響は？
RQ5セグメンテーションと検出タスクにおいて、視覚 prompting は従来のファインチューニングや few-shot ベースラインとどのように比較されるか？

主な発見

Figures データセットで訓練した MAE-VQGAN は、視覚 prompts を用いたときに前景セグメンテーションと単一オブジェクト検出で高い性能を示す。
Figures で訓練された prompting モデルは、ImageNet のみで事前訓練されたモデルを大幅に上回り、複数のタスクでいくつかの非専門的なベースラインをも凌ぐ。
視覚プロンプトに例を追加すると、一般に Pascal-5i および Pascal VOC データセットで segmentation の mIOU と検出精度が向上する。
prompt design の選択（縦型 vs 横型のレイアウト、黒/白のマスクなど）は prompting の品質に影響を与え、特定のレイアウトが関連領域への注意を高める。
Prompt ensembling（複数の prompts からの予測を平均化）は、結果をさらに改善し、タスク間の性能を安定化させる。
MAE-VQGAN trained on Figures は、VQGAN や BEiT ベースラインより鮮明な補完と良好なタスク性能を示し、特に検出とセグメンテーションで優れる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。